Question

Can We Call The Variants Using Haplotype Caller Or Unified Genotyper Without Dbsnp Vcf File?

0

Entering edit mode

10.3 years ago

ivivek_ngs ★ 5.2k

Dear All,

I have some queries. I would like to ask you all , if I can make the variant calls using HC or UG walker of GATK without a dbSNP file or not? Is that a possibility? Can we do that? I want to actually just make the calls for each of my normal , tumor and IPS samples the variants and then find the mutations exclusive to my tumor and IPS samples. In that am not so keen to find the novel mutations. So can I do the variant calling with UG or HC without dbSNP ? Please give me your suggestions if you anyone have performed this or not? If I donot put the dbSNP file ,what other impact will be there downstream apart from not being able to distinguish which are novel or known variants? Is there any other impact. I would appreciate any suggestions.

exome-sequencing variant-calling gatk • 4.3k views

ADD COMMENT • link updated 2.9 years ago by Ram 43k • written 10.3 years ago by ivivek_ngs ★ 5.2k

Ram · Answer 1 · 2014-02-14

2

Entering edit mode

10.3 years ago

Jorge Amigo 14k

Check the GATK docs for HaplotypeCaller and UnifiedGenotyper walkers. they both state the same: it's only used for annotation purposes.

dbSNP file. rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. dbSNP is not used in any way for the calculations themselves. --dbsnp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

Another thing is if you want to perform the VQSR step. in that case, dbSNP data is a useful resource for known variation.

ADD COMMENT • link updated 2.9 years ago by Ram 43k • written 10.3 years ago by Jorge Amigo 14k

0

Entering edit mode

Thank you very much for the reply. I understand that the dbSNP is not used for any calculations but it is it advisable to skip the VQSR step where we put training sets to understand the true positive SNP calls made? My idea of the experiment is to separately find out the variants for my Normal , tumor and IPS. Then subtract the mutations common to tumor and IPS with the normal so that I can only deal with somatic mutations and then match the tumor and IPS to understand the consistency of mutational landscape. This is how am trying to work it up. I would like your suggestion if in doing this can I skip the VQSR step? as VQSR is only to understand on a laymans say how my known and novel variants are oriented in the cloud of gaussian mixture model right? and that the calls are apparently not false and we can separate the true positives from false positives. I would like to have your suggestion @Jorge Amigo

ADD REPLY • link updated 2.9 years ago by Ram 43k • written 10.3 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

In summary yes, you're right. the VQSR step will "only" populate the FILTER column with a PASS if each particular variant is considered to be a true possitive. if you are going to perform all those filtering steps you'll surely end with a very reduced set of variants, but having this VQSR confirmation being able to increase your trust in your findings won't harm at all.

ADD REPLY • link updated 2.9 years ago by Ram 43k • written 10.3 years ago by Jorge Amigo 14k

0

Entering edit mode

Thank you very much for the reply. I did not do the VQSR step for filtering using the VariantRecalibrator walker of GATK, I did a filtering on low DP count that passes the internal quality control metrics. So I selected only the SNPs from the raw variant file and then filtered on basis of DP count with a threshold where very low DP variants are removed. Here I got 40K variants. Now I annotate them with annovar and was trying to only interrogate the variants that are on the exonic region. The exome_summary.csv file output of annovar. I found I get only 18k variants(that includes the non-synonymous,synonymous , stop-gain, stop-loss and unknown variants). This reduction from 40K to 18K looked very suspicious to me so I check the genome_summary.csv where I see all the 40K variants but they include exonic, intronic, intergenic, UTR region variants. I am a bit confused now. When I try to find SNPs for my sample using GATK, the select varaints walker for SNP gives me 40K variants. This should be variants ideally on the exonic region right? or does this also include the variants for the intronic and other non exonic regions? It would be nice if you can give me some suggestions. Am I proceeding on the correct path or am I getting wrong somewhere? Is this likely that the annotate variant counts can be reduced to 18k for only the exonic regions? As I explained in the previous reply that my idea is to understand the genetic consistency of the IPS from tumor subtracting the mutations in normal sample, is this the right way of approaching? It would be nice if I get some expert advice regarding this.

ADD REPLY • link updated 2.9 years ago by Ram 43k • written 10.3 years ago by ivivek_ngs ★ 5.2k