Biostar Beta. Not for public use.
Variant Calls: Freebayes vs GATK
0
Entering edit mode
2.8 years ago
oars • 150
@oars41179

I'm calling variants using three input/reference files (dedup.bam; genome.fa; and chr17.cds.bed) and creating two vcf files, (1) for gatk.vcf and (2) freebayes.vcf.

GATK returned more variants (POS) and more dbSNPs. Scanning the files you quickly notice the different quality scores between the two files, with far greater range in QUAL (super low and high scores) found in the freebies.vcf file. What factors contribute to the higher variant count in GATK vcf files compared to freebayes?

freebayes gatk .vcf • 1.9k views
1
Entering edit mode
2
Entering edit mode

Thanks! vcftools has a feature called --diff

vcftools --vcf SRR1611183.gatk.vcf --diff SRR1611183.freebayes.vcf --diff-site --out gatk_freebayes.diff


It creates a neat outfile with the following contents;

CHROM   POS1    POS2    IN_File  REF1 REF2 ALT1 ALT2
chr17   5036281 5036281 B   G   G   C   C
chr17   5036732 .       1   C   .   A   .
chr17   5036740 5036740 B   C   C   G   G
chr17   5036748 5036748 B   G   G   T   T
chr17   .       5036761 2   .   A   .   C
chr17   .       5036784 2   .   C   .   T


I'm trying to find some consistent themes as to the variant calling discrepancies.

0
Entering edit mode

When examining the tail end of data found in the INFO column, you'll notice a difference between GATK and Freebayes:

GATK

Freebayes
GT:DP:RO:QR:AO:QA:GL    1/1:64:0:0:64:2261:-5,-5,0


Can anyone decipher this information?

2
Entering edit mode

Hello,

Have a look at the header of your vcf files. All these entrys should be described under FORMAT.

fin swimmer

1
Entering edit mode

How big is the difference of the number of variants between them?

One reason can be that freebayes describes multiple variants that are close together as one haplotype if they can be asigned to one allele. Whereas GATK maybe report every change seperately.

Fin swimmer

5
Entering edit mode
2.8 years ago
vdauwera • 920
@vdauwera4658

Keep in mind that the GATK variant callers are designed to be as sensitive as possible and will therefore include many false positives, so you need to apply some filtering steps after calling to remove those false positives, as described in the GATK Best Practices. It's essentially impossible to answer your question without knowing more about how you did the variant calling in both cases, and what kind of filtering and evaluation you did on the results.

It's also important to understand that QUAL scores are calculated differently by different variant callers, so it's tricky to compare them directly. You'll get more insights from evaluating your results relative to known callsets or truth sets.

0
Entering edit mode

Many thanks for your reply! Here are the two call scripts, maybe this would be insightful:

$GATK HaplotypeCaller -I SRR1611183.dedup.bam -O SRR1611183.gatk.vcf -R genome.fa -L chr17.cds.bed  and for freebayes... $ freebayes -f genome.fa -m 20 -q 10 -t chr17.cds.bed SRR1611183.dedup.bam > SRR1611183.freebayes.vcf