GATK - gene panel VCF - too many variants - how to filter out
3
3
Entering edit mode
7.0 years ago
stanedav ▴ 50

Hello guys,

I have question about filtering NGS data with GATK workflow. I have data from NGS Illumina (Sureselect kit), it is targeted gene panel (about 112 genes), usually we are using two softwares - Surecall and Nextgene with default settings. Output of this analyses is VCF file with about 500 variants.

Now I am trying to implement GATK workflow from fastq to VCF. (this workflow: https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS)

Basically it is - bam created by bwa-mem, then I remove duplicates and do base recalibration. Then I use Haplotype Caller for creating gvcf and finally VCF.

Last step that I do, is hard filtering raw snps VCF with filters from GATK -

--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"

Problem is, that after this step my vcf has about 40 000 variants passing the filter - do you guys have any idea, where I need to change parameters to get about same amount of variants as running by softwares like SureCall?

Thanks for any tips :)

NGS GATK VCF • 4.1k views
ADD COMMENT
0
Entering edit mode

Hello,

40 000 variants in 112 genes. That's much to much. Can you post the commands you use for the alignment and variant calling? And also the first few entrys of your vcf file are interessting.

fin swimmer

ADD REPLY
0
Entering edit mode
6.9 years ago
stanedav ▴ 50

Here is my complete workflow. fastq data are from Sureselect kit:

./bwa mem -M -R \
    @RG\\tID:ID_1234\\tSM:SM_1234\\tLB:LB_1234\\tPL:Illumina\\tPI:150 '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    '/1234/1234_1.fastq' '/1234/1234_2.fastq' > '/1234/1234.sam'  

samtools sort -@ 8 \
    -T /path_to_temp/ \
    -o /1234/1234.bam

java -jar build/libs/picard.jar MarkDuplicates \
    INPUT='/1234/1234.bam' \
    OUTPUT='/1234/1234_dedup.bam' \
    METRICS_FILE='/1234/1234_metrics'

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/dbsnp_138.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf' \
    -knownSites '/home/dee/bioinformatics/_reffiles/hg19/1000G_phase1.indels.hg19.vcf' \
    -o '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T PrintReads \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup.bam' \
    -o '/1234/1234_remdup_recal.bam' \
    -BQSR '/1234/1234_recal.table'

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -I '/1234/1234_remdup_recal.bam' \
    -o '/1234/1234_remdup_recal.g.vcf' \
    --emitRefConfidence GVCF \
    --variant_index_type LINEAR \
    --variant_index_parameter 128000

java -jar GenotypeAnalysisTK.jar -T GenotypeGVCFs  \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -o '/1234/1234.vcf' \
    --variant '/1234/1234_remdup_recal.g.vcf' \

java -jar GenomeAnalysisTK.jar -T VariantFiltration \
    -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \
    -V '/1234/1234.vcf' \
    --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "GATKdef" \
    -o '/1234/1234_filtered.vcf'
ADD COMMENT
4
Entering edit mode

You should use bed flle from SureSelect kit.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \ -R '/home/dee/bioinformatics/_reffiles/hg19/ucsc.hg19.fasta' \ -I '/1234/1234_remdup_recal.bam' \ -o '/1234/1234_remdup_recal.g.vcf' \ --emitRefConfidence GVCF \ --variant_index_type LINEAR \ --variant_index_parameter 128000 -L sureselect.bed

Another question, If you have only one sample then don't need to use --emitRefConfidence GVCF as well and can ignore GenotypeGVCFs step as well.

ADD REPLY
0
Entering edit mode

Is this just a typo here? Your output file from MarkDuplicates is 1234_dedup.bam but in the next steps you use 1234_remdup.bam as the input file name.

fin swimmer

ADD REPLY
0
Entering edit mode

Yes, just typo, sorry...

ADD REPLY
0
Entering edit mode
6.9 years ago
Robert Sicko ▴ 630

targeted gene panel (about 112 genes)

Do you specify a .bed file in SureCall and Nextgene? I'd bet your abundance of variants in your pipeline come from regions outside your target region. You can filter your filtered vcf for your target region using the bed file for your Sureselect kit.

ADD COMMENT
0
Entering edit mode
6.9 years ago
stanedav ▴ 50

Bed file solved the problem! You are best guys!

ADD COMMENT

Login before adding your answer.

Traffic: 2039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6