Good filtering practices for VCF (Bcftools 1.3)
0
2
Entering edit mode
8.2 years ago
umn_bist ▴ 390

I have a tumor sample without matches. This sample is most likely impure without a certain ploidy, copy #. I am wondering for such samples I am only filtering based on allele frequency, mapping quality, read depth. But I was a bit uncertain with some options on BCFtools 1.3 (for the new call function that replaced view) that I thought you guys might know.

This is the option I am using

bcftools call -Amv > out.vcf
  • -A output all alternate alleles present in the alignments even if they do not appear in any of the genotypes.
  • -v output variant sites only.
  • -m alternative model for multiallelic and rare-variant calling designed to overcome known limitations in -c calling model. Is it recommended to just use -m for purpose of simply calling variants? Between using the old bcftools view -vc and the new -m option, I get a lot less variants.
chr1    150908526    .    A    G    38.3354    .    DP=2;VDB=0.1;SGB=-0.453602;MQ0F=0;AC=2;AN=2;DP4=0,0,2,0;MQ=50    GT:PL    1/1:66,6,0
chr1    150910145    .    A    G    8.79264    .    DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=50    GT:PL    0/1:36,3,0
chr1    150916393    .    T    C    27.3777    .    DP=5;VDB=0.74;SGB=-0.453602;RPB=0.5;MQB=1;MQSB=1;BQB=0.5;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=2,0,1,1;MQ=50    GT:PL    0/1:60,0,68
chr1    150923826    .    C    T    8.79264    .    DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,1,0;MQ=50    GT:PL    0/1:36,3,0
chr1    150924171    .    A    G    8.79264    .    DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=50    GT:PL    0/1:36,3,0
chr1    150924181    .    A    G    8.79264    .    DP=1;SGB=-0.379885;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,0,0,1;MQ=50    GT:PL    0/1:36,3,0
chr1    150924752    .    C    T    22.6577    .    DP=2;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=50    GT:PL    1/1:50,3,0
chr1    150925702    .    taaaaaaaaa    tAAaaaaaaaaa,tAaaaaaaaaa    5.83735    .    INDEL;IDV=3;IMF=0.230769;DP=5;VDB=0.516498;SGB=-0.511536;MQSB=1;MQ0F=0;AC=2,0;AN=2;DP4=1,0,2,1;MQ=50    GT:PL    1/1:32,0,1,30,0,32

Is there a reason why I am not getting GQ info for each read? What worries me is that MQ for all are 50 (?) and MQ0F = 0.

As for novel mutation rate, is this something I should define to improve my variant calling?

-n, --novel-rate float[,...]

likelihood of novel mutation for constrained -C trio calling. The trio genotype calling maximizes likelihood of a particular combination of genotypes for father, mother and the child P(F=i,M=j,C=k) = P(unconstrained) Pn + P(constrained) (1-Pn). By providing three values, the mutation rate Pn is set explictly for SNPs, deletions and insertions, respectively. If two values are given, the first is interpreted as the mutation rate of SNPs and the second is used to calculate the mutation rate of indels according to their length as Pn=float*exp(-a-b*len), where a=22.8689, b=0.2994 for insertions and a=21.9313, b=0.2856 for deletions [pubmed:23975140]. If only one value is given, the same mutation rate Pn is used for SNPs and indels.

Man page for new bcftools.

This is the filtering I am using, provided by GATK and slightly tweaked to include only >2 DP.

--filterExpression "DP > 2.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
RNA-Seq VCF bcftools samtools • 5.0k views
ADD COMMENT

Login before adding your answer.

Traffic: 2714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6