Question

Understanding GATK evaluation metrics

0

Entering edit mode

3.7 years ago

geneart$$ ▴ 50

Hi all, I recently implemented GATK pipeline on a small set of samples (WES) and did hard filtering as I had only 10 samples and got the GATK to spit out the metrics file. I used the default filtering parameters for the first pass and this is the command:

gatk VariantFiltration \
  -V myfam_output_sorted_annotated.snps.vcf \
  -filter "QD < 2.0" --filter-name "FILTER_QD2" \
  -filter "QUAL < 30.0" --filter-name "FILTER_QUAL30" \
  -filter "SOR > 3.0" --filter-name "FILTER_SOR3" \
  -filter "FS > 60.0" --filter-name "FILTER_FS60" \
  -filter "MQ < 40.0" --filter-name "FILTER_MQ40" \
  -filter "MQRankSum < -12.5" --filter-name "FILTER_MQRS-12.5" \
  -filter "ReadPosRankSum< -8.0" --filter-name "FILTER_RPRS-8" \
  -O myfam_output_sorted_filtered.snps.vcf

after implementing "CollectVariantCallingMetrics" the metrics file gave me these values:

TOTAL_INDELS    DBSNP_INS_DEL_RATIO     NOVEL_INS_DEL_RATIO 
7902                    0.811012              0.528926  

TOTAL_SNPS        DBSNP_TITV    NOVEL_TITV  
49556              2.286389      1.522989

My question : 1. Do I need to be concerned with these numbers I got ? These do not fall in the suggested metrics from GATK given below:

Filtering for   Indel Ratio
common           ~1
rare             0.2-0.5

Sequencing Type     # of Variants*  TiTv Ratio
WGS                       ~4.4M  2.0-2.1
WES                       ~41k           3.0-3.3

Do I need to change the filtering parameter? as it looks like I may have high false positives?
Why is dbsnp TiTv that low? I used dbsnp file from GATK and extracted only the chr of interests and made a subset dbsnp file and used it for CollectVariantCallingMetrics.
above all, do I even need to break my head too much about these numbers? If I can view my variants on IGV in comparison to bam and then also view them in variant viewer with clinvar and dbsnp for those specific regions and I can see validation from those databases on some of these snps would that not be robust enough? My point is how much do we relay on these numbers and to what extent we keep on filtering and polishing these?

Any suggestion helps, Thankyou !

next-gen snp GATK metrics • 1.8k views

ADD COMMENT • link 3.0 years ago by geneart$$ ▴ 50

0

Entering edit mode

Did you use the right exome kit interval bed file in the different GATK tools ? Did you try to filter with VQSR and recalculate TiTV ?

ADD REPLY • link 3.0 years ago by Nicolas Rosewick 10k

0

Entering edit mode

yes. The interval bed file I got by checking with the vendor tech support. yes I finally used VQSR to filter and recalculate TiTV.

ADD REPLY • link 3.0 years ago by geneart$$ ▴ 50

score 0 · Answer 1 · 2021-02-19

0

Entering edit mode

3.2 years ago

cdm4672 • 0

Hello. I have these same questions. Did you ever find the answer? The metrics are useless without knowing which thresholds are expected and/or ideal.

ADD COMMENT • link 3.2 years ago by cdm4672 • 0

0

Entering edit mode

Hi cdm4672, nope. I could not find an answer. I decided that my numbers are pretty close to the metrics from GATK if not exactly in the range and also the interesting snps I pursued further by validating on external sites and also visualizing them in IGV to make sure they are indeed true snps etc. So in essence I moved on. ... :) But it would have been nice to find a better answer.

I have shown default parameters here. I guess you could play with your filtration parameters and see if it improves any more?

ADD REPLY • link 3.2 years ago by geneart$$ ▴ 50