Mutation count in multi-sample vcf file
1
0
Entering edit mode
5.5 years ago
LimMo ▴ 30

Hello all,

I have a one vcf file of 26 samples, what I need is to annotate this file and count the number of mutations for each gene for each sample of those 26.

I tried to use Annovar, but it seems that it doesn't support multi-sample vcf file.

Any suggestion or help, please?

VCF Annotate Mutations Count • 3.8k views
ADD COMMENT
0
Entering edit mode
5.5 years ago

I wrote groupbygene: http://lindenb.github.io/jvarkit/GroupByGene.html

your vcf must be annotated with VEP or SNPEFF.

$ curl -s -k "https://raw.github.com/arq5x/gemini/master/test/test4.vep.snpeff.vcf" |\
java -jar dist/groupbygene.jar |\
head | column  -t
#chrom  min.POS    max.POS    gene.name  gene.type         samples.affected  count.variations  M10475  M10478  M10500  M128215
chr10   52004315   52004315   ASAH2      snpeff-gene-name  2                 1                 0       0       1       1
chr10   52004315   52004315   ASAH2      vep-gene-name     2                 1                 0       0       1       1
chr10   52497529   52497529   ASAH2B     snpeff-gene-name  2                 1                 0       1       1       0
chr10   52497529   52497529   ASAH2B     vep-gene-name     2                 1                 0       1       1       0
chr10   48003992   48003992   ASAH2C     snpeff-gene-name  3                 1                 1       1       1       0
chr10   48003992   48003992   ASAH2C     vep-gene-name     3                 1                 1       1       1       0
chr10   126678092  126678092  CTBP2      snpeff-gene-name  1                 1                 0       0       0       1
chr10   126678092  126678092  CTBP2      vep-gene-name     1                 1                 0       0       0       1
chr10   135336656  135369532  CYP2E1     snpeff-gene-name  3                 2                 0       2       1       1
ADD COMMENT
0
Entering edit mode

Just to make sure, here "https://raw.github.com/arq5x/gemini/master/test/test4.vep.snpeff.vcf" it should be my vcf file after annotated by VEP or SnpEff, right?

ADD REPLY
1
Entering edit mode

it should be my vcf file after annotated by VEP or SnpEff,

yes

ADD REPLY
0
Entering edit mode

Could you please help me with the annotation, I'm new to that field. Are there clear steps for the annotation?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

After annotating with SnpEff and running GroupByGene, I got this error:

[SEVERE][GroupByGene]Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
    at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
    at com.github.lindenb.jvarkit.util.vcf.VCFUtils.parseHeader(VCFUtils.java:184)
    at com.github.lindenb.jvarkit.tools.groupbygene.GroupByGene.read(GroupByGene.java:379)
    at com.github.lindenb.jvarkit.tools.groupbygene.GroupByGene.doWork(GroupByGene.java:694)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1145)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1303)
    at com.github.lindenb.jvarkit.tools.groupbygene.GroupByGene.main(GroupByGene.java:710)
[INFO][Launcher]groupbygene Exited with failure (-1)

What's could be the problem?

ADD REPLY
0
Entering edit mode

Thanks for the help, I try to run GroupByGene using the same command and file just to see the output but it gives this error. Also when I tried my file (I was concerning if the problem from my file but that also happen for the provided file):

[SEVERE][GroupByGene]Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
    at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
    at com.github.lindenb.jvarkit.util.vcf.VCFUtils.parseHeader(VCFUtils.java:184)
    at com.github.lindenb.jvarkit.tools.groupbygene.GroupByGene.read(GroupByGene.java:379)
    at com.github.lindenb.jvarkit.tools.groupbygene.GroupByGene.doWork(GroupByGene.java:694)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1145)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1303)
    at com.github.lindenb.jvarkit.tools.groupbygene.GroupByGene.main(GroupByGene.java:710)
[INFO][Launcher]groupbygene Exited with failure (-1)

Any suggestion or help, please?

ADD REPLY
0
Entering edit mode

what was the input file ? how is it formatted ? what is the output of grep -m1 CHROM your_input.vcf ?

ADD REPLY
0
Entering edit mode

The input file was the one after annotated using SnpEff, This is what I got for grep -m1 CHROM myfile.ann.vcf :

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  LIBRARY_NAME=S2000  2:LIBRARY_NAME=S2000    3:LIBRARY_NAME=S2000    4:LIBRARY_NAME=S2000    5:LIBRARY_NAME=S2000    6:LIBRARY_NAME=S2000    7:LIBRARY_NAME=S2000    8:LIBRARY_NAME=S2000    9:LIBRARY_NAME=S2000    10:LIBRARY_NAME=S2000   11:LIBRARY_NAME=S2000   12:LIBRARY_NAME=S2000   13:LIBRARY_NAME=S2000   14:LIBRARY_NAME=S2000   15:LIBRARY_NAME=S2000   16:LIBRARY_NAME=S2000   17:LIBRARY_NAME=S2000   18:LIBRARY_NAME=S2000   19:LIBRARY_NAME=S2000   20:LIBRARY_NAME=S2000   21:LIBRARY_NAME=S2000   22:LIBRARY_NAME=S2000   23:LIBRARY_NAME=S2000   24:LIBRARY_NAME=S2000   25:LIBRARY_NAME=S2000   26:LIBRARY_NAME=S2000
ADD REPLY
1
Entering edit mode

please post a fragment of your vcf, including the header in something like https://gist.github.com/ or whatever...

ADD REPLY
0
Entering edit mode

See here: /................./

ADD REPLY
1
Entering edit mode

this works on my machine:

 java -jar dist/groupbygene.jar your.vcf

#chrom  min.POS max.POS gene.name   gene.type   samples.affected    count.variations    10:LIBRARY_NAME=S2000   11:LIBRARY_NAME=S2000   12:LIBRARY_NAME=S2000   13:LIBRARY_NAME=S2000   14:LIBRARY_NAME=S2000   15:LIBRARY_NAME=S2000   16:LIBRARY_NAME=S2000   17:LIBRARY_NAME=S2000   18:LIBRARY_NAME=S2000   19:LIBRARY_NAME=S2000   20:LIBRARY_NAME=S2000   21:LIBRARY_NAME=S2000   22:LIBRARY_NAME=S2000   23:LIBRARY_NAME=S2000   24:LIBRARY_NAME=S2000   25:LIBRARY_NAME=S2000   26:LIBRARY_NAME=S2000   2:LIBRARY_NAME=S2000    3:LIBRARY_NAME=S2000    4:LIBRARY_NAME=S2000    5:LIBRARY_NAME=S2000    6:LIBRARY_NAME=S2000    7:LIBRARY_NAME=S2000    8:LIBRARY_NAME=S2000    9:LIBRARY_NAME=S2000    LIBRARY_NAME=S2000
1   10402   10440   DDX11L1 ann_gene_name   2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENSG00000223972 ann_gene_id 2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENSG00000227232 ann_gene_id 2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000423562 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000438504 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000450305 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000456328 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000488147 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000515242 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000518655 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000538476 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   ENST00000541675 ann_feature_id  2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
1   10402   10440   WASH7P  ann_gene_name   2   2   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
ADD REPLY
0
Entering edit mode

This is more logical, I'm thinking that the command you provided not correct:

file.vcf |\
java -jar dist/groupbygene.jar

This is what I have been tried and give me the error.

ADD REPLY
0
Entering edit mode
file.vcf |\
java -jar dist/groupbygene.jar

I didn't wrote this.

ADD REPLY
0
Entering edit mode

Sorry but that what I understand from your post above with:

$ curl -s -k "https://raw.github.com/arq5x/gemini/master/test/test4.vep.snpeff.vcf" |\
java -jar dist/groupbygene.jar
ADD REPLY
0
Entering edit mode

Hello Pierre,

I really appreciate this tool and it can be very useful for my project.

I wonder how it handles heterozygous vs homozygous variants. Will homozygous variants be counted as 2 variations with your tool?

Yiming

ADD REPLY

Login before adding your answer.

Traffic: 1656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6