Multiple VCF file comparision
3
2
Entering edit mode
9.0 years ago
alok.helix ▴ 120

Hallo friends,

I am interested in documenting a given chromosomal position for a mutation across "n" numbers of vcf files. How can i compare multiple vcf file for a given mutation in the position?

Than king you.

vcf iontorrent SNP mutation genome • 4.7k views
ADD COMMENT
3
Entering edit mode
9.0 years ago

a very fast and efficient way of getting this information is with a simple grep. say you want to inspect position chr21:123456, then:

grep -P "^chr21\t123456\t" *vcf

you will then end up with a list of files and their lines for that particular position. note that if the variant is not present in a particular file you won't get any output from that file. you may want to check bcftools merge if you need a more elaborated solution, since it allows you to merge the vcf files filling empty variant sites with ./. across all files, so that you can query your position of interest being sure that no empty sites are present.

ADD COMMENT
0
Entering edit mode

Thanks Jorge, I too wondered, what if there was SNP at a given position in "x" or "y" file, but then u answered the same, I am looking forward to make a database of all the "valid" mutation across N number of patients is there anything simpler I could do in my sql or R's sql package across 100's such file? I hope am not complicating the query too much!!

ADD REPLY
1
Entering edit mode

I don't know what you mean with "valid" mutations, but

  • if you want all the sites on all files, then go for grep (that's my answer's section 1)
  • if you want all the sites that vary on all files then you have to intersect your files (that's airan's answer)
  • if you want all the sites that vary on at least one file then you have to merge your files (that's my answer's section 2)
ADD REPLY
0
Entering edit mode

Oh!! by valid, I meant sanger cross-checked

ADD REPLY
2
Entering edit mode
9.0 years ago
iraun 6.2k

If you want to intersect multiple vcf files into one there are different tools. I usually use vcf-isec function, inside vcftools package (http://vcftools.sourceforge.net/perl_examples.html). Then you can just grep your exact position and see what is going on across samples. This is an example of how it works vcf-isec:

# Include positions which appear in at least two files

vcf-isec-o -n +2 A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz 

Don't forget to first compress and tabix all your single vcf files:

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz
ADD COMMENT
0
Entering edit mode

Thanks airan I will try this above command across a sample of 5-6 custom vcf files!! & see what is the output!! I will getback to you, meanwhile you can suggest on below mentioned comment for database building across multiple vcf files!!

ADD REPLY
0
Entering edit mode

upon running the suggested method, i got an empty vcf file and the message was

Empty fields in the header line, the column 11 is empty, removing.

at /usr/share/perl5/Vcf.pm line 185.

ADD REPLY
2
Entering edit mode
9.0 years ago
Len Trigg ★ 1.6k

If you are only looking for isolated SNPs then the above suggestions should be fine. If your mutations have any complexity (indels, MNPs, or if your mutation is nearby other mutations such that the variant caller may have produced a haplotype call), then you need to use more sophisticated tools.

At the very least, you will need to run your input VCFs through a normalization procedure such as vcfallelicprimitives prior to intersection, although there are still cases that can only handled with sophisticated comparison tools such as vcfeval from RTG Tools or RTG Core. (Although vcfeval is intended for pairwise comparisons so you would have to do successive intersections)/

ADD COMMENT

Login before adding your answer.

Traffic: 2970 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6