Question

How To Align Reads Obtained From Sequence Capture

4

Entering edit mode

14.0 years ago

Istvan Albert 100k

I need to align to a reference a large number of reads obtained from a custom exon capture array. Sequence capturing is a technique that allows one to specify which sequences they wish to retain for sequencing.

One possibility is to map the reads to the whole genome, then sub select the regions that correspond to the exons in question but that seems like a lot of unnecessary work.

I could also build my custom genome for this: each exon gets its own sequence id (there will be about 10K of them). In this case I am concerned that most short read aligners may be optimized to treat the reference genome as few but long sequences rather many-many short ones.

Other suggestions and tips are welcome. Thanks.

exome-sequencing next-gen-sequencing sequence • 7.1k views

ADD COMMENT • link updated 5 months ago by Ram 43k • written 14.0 years ago by Istvan Albert 100k

Ram · Answer 1 · 2010-04-21

I would suggest doing whole genome alignments of enriched / target capture sequences. Then the resulting alignments should be filtered for only those reads mapping uniquely within the target region.

If you map to just the target regions, you are likely to get many false placements that can skew later analysis (SNP calls, etc.).

The reasoning is this: If your source DNA came from the whole genome, then the capture process likely also captured some sequences that come from non-exon sequence. If you then map to just the target region, these will be assumed to be lower quality alignments to your target region, rather than correctly being excluded.

I would admit, if your source material is mRNA, from a transcriptome library, then the target only reference sequence may have some merit. In that case, you must also worry about exon boundaries and rearrangements, so I doubt this applies in your case.

I found the article "Targeted capture and massively parallel sequencing of 12 human exomes", Shendure, et.al. to be quite informative with regards to methods.

score 1 · Answer 2 · 2010-04-20

1

Entering edit mode

14.0 years ago

Eric Normandeau 11k

Hi Istvan,

We've had success in the lab doing de novo assemblies of the transcriptome (cDNA), then saving the consensus contigs from the assembly (about 10000 of them) and using them as a 'reference genome' to align millions of sequences to it.

In order to do this, we have used the non-free software CLC Genomic workbench with success but, given that you are talking about aligning a few thousand sequences instead of millions, I suppose many free short-read aligners, like 'mira', could do the trick for you.

It would then be pretty easy, using for example Biopython, to pass through the created contigs and keep only those for which there is enough information, let's say a minimum number of sequences aligned to the consensus.

Does this approach seem appropriate for your problem?

Cheers

ADD COMMENT • link 14.0 years ago by Eric Normandeau 11k

0

Entering edit mode

10K are the target exons, reads are far more, 60 million. But I will give this is shot as it does seem like the best choice.

ADD REPLY • link 14.0 years ago by Istvan Albert 100k

0

Entering edit mode

Sorry, I misunderstood you. What you describe then is very similar in magnitude to what we did. This operation basically took half an hour for 1.5 million reads to be aligned to 10K consensus sequences on my desktop computer, using only one 2.66 GHz core2 procesor, with 8Go of Ram, and under a 64-bit os.

ADD REPLY • link 14.0 years ago by Eric Normandeau 11k