Question

Basic Of Denovo Assembler

5

Entering edit mode

12.9 years ago

Aboozar ▴ 50

Hi, I'll appreciate your help regarding following questions:

why should we map our reads with a reference genome? 1.
how a denovo assembler software work without a ref genome, e.g. it works using other source as its references such as EST, GSS etc or maybe it works only using overlapping parts of reads?

next-gen sequencing • 2.9k views

ADD COMMENT • link updated 10.4 years ago by Biostar 20 • written 12.9 years ago by Aboozar ▴ 50

0

Entering edit mode

It would be better to split the two questions into two separate questions.

ADD REPLY • link 12.9 years ago by Jan Van Haarst ▴ 300

0

Entering edit mode

Hi Aboozar. Welcome to Biostars! I would like to point out that, without much more context, it is very unlikely that the forum users will be able to provide you with a sensible answer. Please take some time to tell us what kind of data you have, what is your experiment's outlines, and exactly what you are trying to accomplish. You will find that you will get much more useful answers that way. These answers will in turn help others that may have similar questions. Cheers

ADD REPLY • link 12.9 years ago by Eric Normandeau 11k

score 6 · Answer 1 · 2011-05-16

6

Entering edit mode

12.9 years ago

Jeremy Leipzig 22k

Yes, it's always easier to solve a jigsaw puzzle by looking at the box. When you don't have the box you need to compare the pieces themselves. This becomes onerous with so many pieces of different lengths and shapes, so most modern assemblers chop the pieces up into even smaller squares. This might seem ridiculous, but it allows the puzzle to be solved using an index rather than comparing a billion pieces with each other.

ADD COMMENT • link 12.9 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Nice answer! Why is it not possible to make an index of the reads without chopping then into (fixed-length) k-mers?

ADD REPLY • link 10.4 years ago by Irsan ★ 7.8k

score 4 · Answer 2 · 2011-05-16

4

Entering edit mode

12.9 years ago

Pierre Lindenbaum 161k

if you want to detect the mutations , you'll need to know where the reads are mapped.
The Velvet assembler is using the overlapping parts of the short reads ( de Bruijn graph )

ADD COMMENT • link 12.9 years ago by Pierre Lindenbaum 161k

score 3 · Answer 3 · 2011-05-16

My crack at question 1:

If I'm fairly certain that my current sequencing data has a high similarity to an already published reference genome, it's a lot faster to align to a reference genome than it is to try for a de novo assembly. De novo assembly will also require lots more RAM (I think upwards of 64 GB for assembly vs 3-5 GB for mapping) and perhaps deeper coverage depending on your technology.

However, you do impose a prior expectation of what you think the entire genome looks like, which is not accurate (along the lines of what Jeremy said in his answer). While >99% of any human genome (for instance) may be identical to hg19, the fraction that doesn't match could have interesting features such as indels or structural rearrangements that you may miss. Short read aligners can only tell you what resembles your reference genome, not the inevitable parts that differ too much. These reads will simply be unmapped.

One approach to this problem I've heard of, but don't have much experience with, is to first do a short-read alignment and knock your >99% matching to hg19 out of the way, then attempt an assembly on the remaining reads to find the structural features represented by the high-quality unmapped reads. It looks like this might be called comparative genome assembly. There are also programs like BreakSeq out there that will specifically look for structural rearrangements.

These approaches are computationally cheaper than assembly, take advantage of a reference sequence, but still acknowledge that there are unique structural features to any genome.

score 1 · Answer 4 · 2011-05-19

It is done because most of the genomes like human and mouse are almost complete and well annotated, most of the sequencing based questions can be answered by aligning the reads against the respective genome. Computationally it is much more easier and efficient to align reads rather than to do a de novo assembly. It is much easier to parallelize an aligner as compared to an assembler, and assemblers require an order of magnitude more memory as compared to aligners.
De novo assembly is good if you do not have a reference genome to start with (eg. some exotic fish). The assemblers makes a graph of the overlapping parts of the reads (k-mers) and then find 'long paths' within this graph and reports them as contigs. Assembling a mammalian genome is quite expensive and require > 30x coverage. Even then there are problems with repeats and mis-assembly which can be difficult to correct. Ideally you would need long reads or at-least paired end reads with long insert size to get good assembly results.