Find Rare SNPs (dbSNP/clinVar)
2
0
Entering edit mode
8.6 years ago

Hey guys,

I'm currently working on my final project for college and it's about finding SNPs. So the basic idea is to implement a method to find functional and rare SNPs based on dbSNP and clinVar. I'm only allowed to use the downloaded VCF files of both of these data bases. So my question is if anybody has any related experience, especially on what is the best and efficient way to find those rare SNPs in those VCF files. Some of the SNPs come with the MAF (Minor Allel Frequency) but since there are plenty with no MAF I'm afraid there will be missing a lot if I filter based on MAF. Ideas?

SNP dbSNP clinVar • 3.2k views
ADD COMMENT
0
Entering edit mode

Can you clarify a bit more what you mean about finding SNPs? If you are working ONLY with dbSNP and ClinVar, then all of the entries in those databases ARE SNPs (unless there are also CNVs in ClinVar now). If you are searching those databases for rare SNPs well, the only thing that defines how rare a SNP is is the Minor Allele Frequency.(<1% or <0.5% are the typical cut-offs used).

If you are doing work with an exome or whole genome sequence and need to find rare variants in that dataset, using only dbSNP and ClinVar as the resources then that is a different story and amounts to following some sort of Best-Practices protocl (GATK has one) for calling variants and filtering them. Of course if you are doing that dbSNP and ClinVar are only starting points, there are other population frequency databases for SNPs out there whose frequencies aren't necessarily included in dbSNP...

ADD REPLY
0
Entering edit mode

The point is to cut out common SNPs (for example a MAF >5% ) and find those which are not common (based on dbSNP and later maybe more Databases) and potentially pathogenic (based on clinVar). I'm focusing on only exomic data so the first step would be to filter those out of the databases.

ADD REPLY
0
Entering edit mode

Ok, the key point there was the last little but, that you are working with exome data. You didn't specify in your original post which made it ambiguous. You could have been looking through the databases themselves and just doing filtering or data reduction.

ADD REPLY
1
Entering edit mode
8.6 years ago
DG 7.3k

You're right in that dbSNP filtering alone won't remove all of the common SNPs as the minor allele frequencies are less comprehensive as it comes only from limited datasets. The Exome Aggregation Consortium Dataset for instance provided exome-based allele frequencies from over 65000 samples. I routinely see things that don't show up with minor allele frequencies in dbSNP (not in 100 genomes or Exome Variant Server sets) that are relatively common in ExAc. However, that said filtering on rarity is typically done based purely on minor allele frequency. You can also be more aggressive and filter out any site that appears in dbSNP, but this is typically a little too aggressive as there are some legitimate rare SNPs in dbSNP.

There are a ton of tools out there to annotate exome data, but they use more than just these VCF files. So for your project you may want to look at a wrapper for vcftools or vcftools which would annotate your exome data based on the downloaded VCFs. You could also use bedtools and pybedtools, PyVCF to do VCF-file manipulations, etc.

ADD COMMENT
0
Entering edit mode
8.6 years ago

Finding statistically significant, rare SNPs is no easy feat. The most common way I know of to detect them is using GATK. If you're trying to come up with a novel method, reading the docs there might be at least a good starting point.

ADD COMMENT

Login before adding your answer.

Traffic: 2705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6