Question

Gatk Multi-Sample Vcf Variantfiltration

4

Entering edit mode

10.9 years ago

William ★ 5.3k

How does GATK VariantFiltration work on multi-sample vcf files?

VariantFiltration is used to annotate likely false positive SNP's based on certain formula's:

--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)"  --filterName "HARD_TO_VALIDATE" 

--filterExpression "DP < 5 " --filterName "LowCoverage" 

--filterExpression "QD < 1.5 " --filterName "LowQD" 

--filterExpression "SB > -10.0 " --filterName "StrandBias"

--filterExpression "QUAL > 30.0 && QUAL < 50.0 " --filterName "LowQual" 

--clusterWindowSize 10

It is easy enough to understand how this works on single sample VCF files, but how does this work on multi-sample vcf files?

For example the low coverage filter, will it annotate the SNP low coverage if

a) all of the samples combined in total have less than 5 reads for the SNP

b) each of the samples has less than 5 reads

c) one or more of the samples has less than 5 reads.

The same for the MQ0, QD and SB annotation. Are they set when:

a) all of the samples combined reach the threshold

b) each sample it self reaches the threshold

c) or one or more of the samples reach the threshold

The lowqual and snpcluster annotation are set I guess based on all samples combined.

gatk vcf • 11k views

ADD COMMENT • link updated 9.3 years ago by Biostar 20 • written 10.9 years ago by William ★ 5.3k

score 2 · Answer 1 · 2013-05-26

not a direct answer to your question: I wrote a java tools filtering the VCF with javascript. See

https://github.com/lindenb/jvarkit/wiki/VCFFilterJS

With java script, you'll be able to declare some functions and run some loops. It uses the javascript-rhino technology: http://en.wikipedia.org/wiki/Rhino_%28JavaScript_engine%29

for example, say you want to keep the variations from https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf having at leat two samples with DP < 200. The script would be:

function myfilterFunction()
    {
    var samples=header.genotypeSamples;
    var countOkDp=0;


    for(var i=0; i< samples.size();++i)
        {
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = variant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp < 200 ) countOkDp++;
        }
    return (countOkDp>2)
    }
myfilterFunction();

Example:

$ curl -s "https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf" | java -jar  dist/vcffilterjs.jar  -f filter.js | grep -v "##"

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    BLANK    NA12878    NA12891    NA12892    NA19238    NA19239    NA19240
chr22    42526449    .    T    A    151.47    .    AC=1;AF=0.071;AN=14;BaseQRankSum=2.662;DP=1226;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=41.2083;MQ=240.47;MQ0=0;MQRankSum=0.578;QD=4.89;ReadPosRankSum=3.611    GT:AD:DP:GQ:PL    0/1:23,8:31:99:190,0,694    0/0:188,0:190:99:0,478,5376    0/0:187,0:187:99:0,493,5322    0/0:247,0:249:99:0,634,6728    0/0:185,0:185:99:0,487,5515    0/0:202,0:202:99:0,520,5857    0/0:181,1:182:99:0,440,5362
chr22    42526634    .    T    C    32.60    .    AC=1;AF=0.071;AN=14;BaseQRankSum=1.147;DP=1225;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=50.0151;MQ=240.65;MQ0=0;MQRankSum=1.151;QD=1.30;ReadPosRankSum=1.276    GT:AD:DP:GQ:PL    0/1:21,4:25:71:71,0,702    0/0:187,2:189:99:0,481,6080    0/0:233,0:233:99:0,667,7351    0/0:230,0:230:99:0,667,7394    0/0:174,1:175:99:0,446,5469    0/0:194,2:196:99:0,498,6239    0/0:174,0:175:99:0,511,5894
chr22    42527793    rs1080989    C    T    3454.66    .    AC=2;AF=0.167;AN=12;BaseQRankSum=-3.007;DB;DP=1074;DS;Dels=0.01;FS=0.000;HRun=1;HaplotypeScore=75.7865;MQ=209.00;MQ0=0;MQRankSum=3.014;QD=9.36;ReadPosRankSum=0.618    GT:AD:DP:GQ:PL    ./.    0/1:72,90:162:99:1699,0,1767    0/1:103,96:202:99:1756,0,2532    0/0:188,0:188:99:0,526,5889    0/0:160,0:160:99:0,457,4983    0/0:197,0:198:99:0,544,6100    0/0:156,0:156:99:0,439,5041

score 0 · Answer 2 · 2014-02-13

As I understand the correct answer is a) all of the samples combined in total have less than 5 reads for the SNP

It uses the DP flag in the info column of the VCF file, and this is the combined depth over all samples. Note, however, that it can be smaller than the total number of raw reads that map there because the GATK genotypers apply some read filtering before SNP calling.

Also the other statistics are calculated over all samples, as I understand.