Biostar Beta. Not for public use.
Question: Applying hard filters for variants
Entering edit mode

I am currently working on influenza virus and ebola virus. I have 45 virus samples, so I have 45 bam files aligned with the influenza reference genome.fa.

java -Xmx16g$out_folder/tmp -jar GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-nt 12 \
-dcov 10000 \
-glm BOTH \
-R influenza.fa \
-l INFO \
-o A_California_Influenza_Virus.raw.vcf \
--sample_ploidy 1 \

I got the raw VCF file (A_California_Influenza_Virus.raw.vcf) for 45 samples in the single VCF. I have 1400 VCF records in the raw VCF file.

As per the GATK best practice pipeline research paper, I applied hard filtering option for small datasets.

Is my VCF records small to go for hard filtering?

Then I selected snps alone in a separate VCF file.

java -jar /data1/software/gatk/current/GenomeAnalysisTK.jar -T SelectVariants -R A_California_Influenza_Virus_H1N1.fa -V A_California_Influenza_Virus.raw.vcf -selectType SNP -o VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf

Then I applied hard filtering for SNPs.
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R A_California_Influenza_Virus_H1N1.fa -V VariantFiltering/A_California_Influenza_Virus.raw.snps.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "myfilter1" -o VariantFiltering/A_California_Influenza_Virus.filtered.snps.vcf

I understand that the variants matching the above conditions are bad variants.
What does QD < 2.0 mean?
What does FS > 60.0 means?
What does MQ < 40.0 ?
What does MQRankSum < -12.5?
What ReadPosRankSum < -8.0?
What is the threshold value of high confidence variants for QD, FS, MQ, MQRankSum, ReadPosRankSum, DP?

ADD COMMENTlink 4.3 years ago bioinforesearchquestions • 230 • updated 8 months ago Biostar 20
Entering edit mode
ADD COMMENTlink 4.3 years ago Brice Sarver ♦ 2.6k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0