Biostar Beta. Not for public use.
Question: Snp Indel Discovery - Critique A Proposed Workflow
13
Entering edit mode

Hi all,

I am very new to this kind of work so I would appreciate some opinions on the early stages of a SNP/Indel discovery workflow.

  1. Formulate list of genes of interest.
  2. Design custom capture for all exons from these genes.
  3. Sequence (Illumina, Paired End, 100bp reads, 50x coverage minimum)
  4. Remove duplicate paired end reads.
  5. Align to genome (Bfast to allow gaps and find indels as well as SNPs)
  6. Use SAMTools mpileup with varfilt to remove SNPs with a quality score less than 20 or indels with a score less than 50.
  7. Remove SNPs with low coverage (less than 30x?)
  8. Proceed to association type study (details to be worried about later - my major concern is the upstream, NGS stuff at the moment as I have never done it before).

All comments welcome. Thanks in advance!

ADD COMMENTlink 8.8 years ago Travis ♦ 2.8k • updated 8.8 years ago Doctoroots • 780
Entering edit mode
0

Hi Travis. This all looks well thought of and solid to me. I don't know what I would add other than discussing assembly details and such. Would you have more specific questions about some parts of this workflow?

ADD REPLYlink 8.8 years ago
Eric Normandeau
10k
Entering edit mode
0

In case you are working with tumors, I would add that you should sequence the germline matched DNA to substract the germline snps and be able to distinguish the somatic events.

ADD REPLYlink 8.8 years ago
toni
♦ 2.1k
Entering edit mode
0

Thanks a lot for the responses guys!

Tony: This study won't be cancer-related. Perhaps a good thing as it sounds like that complicates things.

Eric: More specifics are welcome. I am a complete newbie - I have never run any of these tools before!

ADD REPLYlink 8.8 years ago
Travis
♦ 2.8k
Entering edit mode
0

Minor detail: it's easier to remove duplicates after aligning.

ADD REPLYlink 8.8 years ago
Mikael Huss
4.6k
Entering edit mode
0

Come to think of it, I don't even know.how to remove duplicate pairs yet! Any pointers? :)

ADD REPLYlink 8.8 years ago
Travis
♦ 2.8k
Entering edit mode
0

I am not sure how removing exact paired-end duplicates is going to change anything. Moreover, if you do exon capture, you may end up with too short sequences to do paired-end, at-least if you use an array with oligos representing your exons on it. Since you have no specific question in there, it is a bit hard to discuss specific details.

ADD REPLYlink 8.8 years ago
Eric Normandeau
10k
Entering edit mode
0

Dont for get:

7.5. What to do when we got _too many_ / _no sensible_ variants.

ADD REPLYlink 8.8 years ago
brentp
23k
6
Entering edit mode

You may find the Broad's Best Practices for Variant Detection of some use. You can do duplicate detection with Picard's MarkDuplicates.

One step you're missing is local realignment around InDels. Quoting the Broad:

The Multiple Sequence Alignment (MSA) realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome (see Figure). Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment.

ADD COMMENTlink 8.8 years ago David Quigley 11k
Entering edit mode
0

Excellent pointers and I'll definitely look in that direction! A great help :)

ADD REPLYlink 8.8 years ago
Travis
♦ 2.8k
4
Entering edit mode

I agree that your workflow is indeed comprehensive and well thought. some additional considerations to add to those stated by David:

i would consider setting a lower limit for SNP inclusion (less than X30). you should be mindful that setting a higher lower limit was shown to decrease sensitivity without a substantial gain in specificity and should be considered in your specific experimental context.

comparing your called SNPs against the dbSNP data base is also considered good practice, if the proportion of novel SNPs in your data is suspiciously high (>10% in coding regions), you should consider setting a higher quality bar.

SNP calling quality could also be assessed by checking the Transition/Transversion ratio which is expected to be around 3.3 in whole exome sequencing.

ADD COMMENTlink 8.8 years ago Doctoroots • 780
Entering edit mode
0

These are changes and additions I'll definitely take into account. The sequence provider was offering x150 coverage on our selected genes! Do you think I should perhaps increase the number of genes/decrease the coverage?

ADD REPLYlink 8.8 years ago
Travis
♦ 2.8k
Entering edit mode
0

Hi Travis, i believe x150 is more than enough, and if you have additional genes you are interested in, you should consider adding them at the cost of lowering the coverage (financially speaking). These 2 papers: http://nar.oxfordjournals.org/content/36/16/e105.short http://nar.oxfordjournals.org/content/35/15/e97.short claim coverage of x20 is sufficient for SNP calling. i would say that if you extend this to x50 you obtain sufficient coverage.

ADD REPLYlink 8.8 years ago
Doctoroots
• 780

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0