Question

dbSNP annotation database and appropriate filtering in somatic variant calling pipelines

1

Entering edit mode

5.9 years ago

svlachavas ▴ 790

Dear Biostars,

i would like to ask a more general question for utilizing external databases, for somatic variant calling filtering pipelines. For example, in a lot of scientific publications and in various forums-like here-it is mentioned that if a SNP has an rs number in the dbSNP database (https://www.ncbi.nlm.nih.gov/SNP/), is mainly considered as germline, correct ?

however, in one of our somatic variant filtering pipelines, prior annotating with dbSNP, we have filtered any variants that had a MAF >=0.01 in any of the 4 following different population databases:

1000gp3, gnomad, ESP6500 and ExAC.

Thus,in your opinion, even these variants that remained after the population filtering procedure, and have an rs accession number, still could be considered as "germline" ? or as they are definately rare based on these populations, could be considered as somatic candidates ?

For example, in an interesting publication for a variant calling pipeline, it is mentioned:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4561496/

"For dbSNP, we used the set of nonflagged variants (flagged variants are those for which SNPs <1% minor allele frequency [MAF; or unknown], mapping only once to reference assembly, or flagged as “clinically associated”)".

Or additionally, the database has extra information that could aid in my understanding ??

Just to add an important point: my extra point for this question, is that i would like for a next stage after obtaining a list of "somatic variants candidates", especially for the SNPs, to perform a subsequent analysis to interrogate these SNPs, for their potential effect on TF binding and motif disruption. Thus, as i need also the rs information, i was wondering if with the above approach, this subset from dbSNP, could be considered as "somatic" candidates.

Thank you in advance,

Efstathios-Iason

dbSNP variant filtering somatic variant calling • 4.5k views

ADD COMMENT • link updated 5.8 years ago by Kevin Blighe 87k • written 5.9 years ago by svlachavas ▴ 790

score 5 · Accepted Answer · 2018-06-11

5

Entering edit mode

5.8 years ago

Kevin Blighe 87k

Yes, dbSNP is regarded as containing 'polymorphisms'. Depending on who you talk to, if that person has not stayed up to date with literature, etc., they may even assume that any variant that is listed in dbSNP is entirely harmless and unimportant in relation to pathogenicity and/or 'functionality'.

The truth is that dbSNP contains information on virtually all variant types, including somatic variants, but they are in the minority:

dbSNP accepts nucleotide variations submissions, including both mutations and polymorphisms. These submissions might include variation frequencies, and sample populations. Not only does dbSNP accept single nucleotide variations(trueSNP), it also accepts micro-satellites(STR), indels (insertion deletion variations) as well as multiple bases substitutions(MNP), and is planning to accept large structural variations as well. Please note that dbSNP even has accepted a small subset of somatic mutation submissions. The dbSNP Handbook includes a table describing all the variations accepted by, and included in, dbSNP.

[source: https://www.ncbi.nlm.nih.gov/books/NBK44447/]

Getting back to the study that you mentioned, they first removed [from dbSNP] variants that are listed as having a clinical association and those that have MAF<1% because these are assumed to be most likely functional / pathogenic. One could easily set the MAF lower, or even higher, and there would not be that much complaint.

Whilst saying this, even MAF 5% is not sufficient for some variants, as there are many statistically significant GWAS hits with MAFs >10% that have been shown to have clear functionality in relation to, for example, breast cancer.

Kevin

ADD COMMENT • link 4.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Great and comprehensive answer as always Kevin-thus, overall in your opinion, for my data and goal described above :

if i proceed with the above filtering with MAF <= 0.01 in the 4 aformentioned population dbs, and then-perhaps additional-when annotate with dbSNP keep also only these with a clinical reference, you believe at the end in my final list, even i have various SNPs with an rs number,they could be considered as "putative somatic candidates" ? as a "small subset" of the dbNSP with specific criteria ? and i could use them (as the rest of course that don't have an rs number) for the analyses i have mentioned, like the TF binding motif analysis ?

Best,

Efstathios-Iason

ADD REPLY • link 5.8 years ago by svlachavas ▴ 790

0

Entering edit mode

I am not confidant that you could make that assumption, i.e., that they are putative somatic candidates. If you just filter dbSNP for those SNPs with MAF < 1% in a particular population and that are also 'clinically associated' (based on their presence in ClinVar), then you will still be left with many germline variants that increase / decease risk for a particular disease. They are definitively germline, though, and are not being eliminated from the population based on evolution / fitness. They are obviously not lethal variants, but they just alter the phenotype.

It is important to consider what is a somatic variant, in this context. In cancer, we know what it is, but what is a somatic variant in population genetics? Our DNA is being mutated 'all of the time'.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin,

yes, i definately agree for the context and concept of somatic variants-just to complement my information from above-except the strigent population filtering on these databases, also there are extra filterings implemented, for example removing synonymous mutations, implementing SnpSift with CADD score, etc-and the last step is just annotating the remaining last list of variants with dbSNP-

moreover to mention also, these samples that I'm currently discussing, are paired for each patient (both cancer sample and adjucent normal peripheral bloood)-

overall, based on my updated information about my pipeline, still any of the variants in the final list with an rs number, are still "spurious" ? and if i perform a TF disruption binding site analysis, if possible to focus and prioritize these SNPs that do not have an rs ?

Best,

Efstathios

ADD REPLY • link 5.8 years ago by svlachavas ▴ 790

1

Entering edit mode

Okay, so, they are already somatic variants from Tumour Vs Normal comparisons, and then you annotate with dbSNP. I see.

Regarding SnpSift and CADD, you may consider some other in silico prediction tools, such as these: A: pathogenicity predictors of cancer mutations

All of these listed under 'Non-coding (i.e. regulatory)' already take conservation, TF binding, and other variables into account when calculating their scores.

Also, when you think about it, it is probably a very common event whereby a 'reference' base in a normal sample is mutated to a dbSNP SNP in the tumour. Most of these should also have COSMIC IDs.

Thus, I do not think that you can regard the final variants that still have a rs ID as 'spurious'. They may just be passenger mutations.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Kevin,thanks for the update and also for the very useful link- a lot of tools included, as also some of these we are already implemented them in our pipeline-

just also to mention one extra important thing: the aformentioned data I refer here, are exome sequencing data-so,

the point of assessing disruption of TF binding motifs based on SNPs, is based on my idea about identifying any common exonic TF binding sites that are disrputed (i.e exonic enchancers ?) in the 3 patients that i have for a small initial cohort-so, perhaps based on your last comment, i could include all the final list variants, and inspect the results-

one extra comment-in your link, you included CADD in the category of "non-coding" and also mention germline variants-however, do you think that is not applicable and not appropriate in our somatic variant annotation pipeline, as we access the deleteriousness of putative somatic variants, as an extra specific filtering concerning their impact ?

ADD REPLY • link 5.8 years ago by svlachavas ▴ 790

1

Entering edit mode

From what I understand, CADD is a 'general purpose' prediction program that can be used on any type of single nucleotide variant / polymorphism. Due to the way that it was developed, though, I just state that it is for germline variants. It was developed by looking at variants that differ between humans and our ape ancestors, i.e., variants that have brought about the phenotypic changes that make us human. So, it is definitively more about phenotype change and not necessarily pathogenicity.

Some of the other programs are specificallly for somatic variants and looking at pathogenicity due to the fact that they are trained on, for example, ClinVar pathogenic variants.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Ok Kevin, got your point-we are actually using the CADD score through SnpSift, in order to select the most deleterious variants, so i guess through this context our approach is valid-

nevertheless, i would also take a detailed look on the other specific programs that you have mentioned-

ADD REPLY • link 5.8 years ago by svlachavas ▴ 790