I need to intersect multiple VCF files. I've been trying to use vcf-isec from VCFtools and GATK CombineVariants and later on SelectVariants as follows:
VCFtools:
vcf-isec -f -n =4 input1.vcf.gz input2.vcf.gz input3.vcf.gz input4.vcf.gz > output.vcf
GATK:
java -Xmx2g -jar GenomeAnalysisTK.jar -T CombineVariants -R refSequence.fasta --variant input1.vcf --variant input2.vcf --variant input3.vcf --variant input4.vcf -o output_combined.vcf
java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants -R refSequence.fasta -V:variant output_combined.vcf -select 'set=="Intersection";' -o output_intersected.vcf
Although VCFtools gave me 4464 SNPs common for all files while GATK result was 4031 SNPs. VCFtools contains all SNPs identified by GATK plus 433 SNPs.
Where this difference may come from?
different way to parse indels ? FILTERed variants ?....
try to print the variants specific to each set
and the go back the the VCF to see the differences at those points
Thank you for fast response. I will definitely try to invetigate this. Also, in the file resulting from CombineVariants I found SNPs with
set=FilteredInAll
in the INFO column. Does it mean that GATK performes some additional filtering?