Dear All,
I have some queries. I would like to ask you all , if I can make the variant calls using HC or UG walker of GATK without a dbSNP file or not? Is that a possibility? Can we do that? I want to actually just make the calls for each of my normal , tumor and IPS samples the variants and then find the mutations exclusive to my tumor and IPS samples. In that am not so keen to find the novel mutations. So can I do the variant calling with UG or HC without dbSNP ? Please give me your suggestions if you anyone have performed this or not? If I donot put the dbSNP file ,what other impact will be there downstream apart from not being able to distinguish which are novel or known variants? Is there any other impact. I would appreciate any suggestions.
Thank you very much for the reply. I understand that the dbSNP is not used for any calculations but it is it advisable to skip the VQSR step where we put training sets to understand the true positive SNP calls made? My idea of the experiment is to separately find out the variants for my Normal , tumor and IPS. Then subtract the mutations common to tumor and IPS with the normal so that I can only deal with somatic mutations and then match the tumor and IPS to understand the consistency of mutational landscape. This is how am trying to work it up. I would like your suggestion if in doing this can I skip the VQSR step? as VQSR is only to understand on a laymans say how my known and novel variants are oriented in the cloud of gaussian mixture model right? and that the calls are apparently not false and we can separate the true positives from false positives. I would like to have your suggestion @Jorge Amigo
In summary yes, you're right. the VQSR step will "only" populate the FILTER column with a PASS if each particular variant is considered to be a true possitive. if you are going to perform all those filtering steps you'll surely end with a very reduced set of variants, but having this VQSR confirmation being able to increase your trust in your findings won't harm at all.
Thank you very much for the reply. I did not do the VQSR step for filtering using the VariantRecalibrator walker of GATK, I did a filtering on low DP count that passes the internal quality control metrics. So I selected only the SNPs from the raw variant file and then filtered on basis of DP count with a threshold where very low DP variants are removed. Here I got 40K variants. Now I annotate them with annovar and was trying to only interrogate the variants that are on the exonic region. The exome_summary.csv file output of annovar. I found I get only 18k variants(that includes the non-synonymous,synonymous , stop-gain, stop-loss and unknown variants). This reduction from 40K to 18K looked very suspicious to me so I check the genome_summary.csv where I see all the 40K variants but they include exonic, intronic, intergenic, UTR region variants. I am a bit confused now. When I try to find SNPs for my sample using GATK, the select varaints walker for SNP gives me 40K variants. This should be variants ideally on the exonic region right? or does this also include the variants for the intronic and other non exonic regions? It would be nice if you can give me some suggestions. Am I proceeding on the correct path or am I getting wrong somewhere? Is this likely that the annotate variant counts can be reduced to 18k for only the exonic regions? As I explained in the previous reply that my idea is to understand the genetic consistency of the IPS from tumor subtracting the mutations in normal sample, is this the right way of approaching? It would be nice if I get some expert advice regarding this.