Question

A Question About Hybrid/Mix -Genome Assembly

2

Entering edit mode

11.1 years ago

Lhl ▴ 760

Hi There,

Recently I have been trying to improve a genome assembly. It is a plant genome. It was first assembled using 454 data. And again assembled using Illumina data.

I tried to do the job using two strategies. The first one is to work from the beginning by mixing raw reads of both types using de novo assemblers like Velvet and Ray. I call this the direct hybrid assembly. But i also tried to further combine assemblies by both assemblers using a third assembler.

The second one is to assembly 454 reads using Newbler (i.e., GS de novo assembler) and then assemble Illumina reads using Velvet. Then the assemblies were hybridized using a third assembler. I called this the stepwise hybrid assembly approach.

I found that the first strategies produced more wrong assemblies (assessed through comparing scaffolds to protein sequences) than the second one.

I also found that when i further combined the two assemblies produced by the two assemblers (one is better than the other assembler based on my assessment) in strategy one, even more erroneous assemblies were produced.

Could anyone help to suggest potential reasons for this?

Many thanks.

Lhl

assembly • 5.4k views

ADD COMMENT • link updated 11.1 years ago by Ole Kristian Tørresen ▴ 150 • written 11.1 years ago by Lhl ▴ 760

score 2 · Answer 1 · 2013-03-05

2

Entering edit mode

11.1 years ago

Ole Kristian Tørresen ▴ 150

Hi, it could have been useful to know the genome size of your plant, but I reckon it's not too large since you used Velvet.

I'm not sure I agree that those approaches are the best, but you might have not been able to include everything you've done in that post.

Neither Velvet nor Ray is actually made for a mixed assembly. Depending on the length of your 454 reads, you might lose a lot of information when they are chopped up into kmers. (What's the third assembler? Do you shred contigs for the third assembler?)

Newbler is made for 454 reads and use the whole reads instead of chopping them up. I guess this is the best assembly you've seen.

You don't supply enough details to be able to answer your question easily. It might be that combining (how do you combine?) two assemblies, you combine the errors from the two assemblies. A 454 assembly might contain indels, and combining that with a Illumina assembly might not correct them.

If I would give you some advice, I would need more details. What's the estimated genome size? What kinds of libraries do you have and what's the insert size of them? How much coverage do you have?

You could try to use Newbler for all your data. Or, you could try (CVS version) Celera Assembler wgs-assembler.sourceforge.net) which is tuned to handle a mix of Illumina and 454 reads, or maybe the best would be to use MaSuRCA: http://www.genome.umd.edu/SR_CA_MANUAL.htm (newest version is 1.9.4), which is made to reduce the Illumina reads to a smaller, non-redundant set, before it's fed to Celera Assembler.

Good luck!

ADD COMMENT • link 11.1 years ago by Ole Kristian Tørresen ▴ 150

0

Entering edit mode

Hi Ole,

Thanks for your response.

The plant genome size is estimated to be 2.7-2.8 GB. We do not have a reference genome.

I chose Ray because it is designed to be a hybrid assembler (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3119603/).

I used Velvet because it can assemble both long (>200bp) and short reads (<200bp) as mentioned in the manual.

We have only a little bit 454 data (80M), which was produced by randomly sampling the genome.

As for the illumina data. we constructed 3 pair-end libraries with insert length of 125bp, 250bp and 500bp respectively. And we focused mainly on the gene-rich regions. We also have some Illumina sequences produced using RADseq (http://bfg.oxfordjournals.org/content/9/5-6/416.abstract).

The third assembler i used to combine assemblies (e.g., combining ray assembly of both types of reads AND Velvet assembling both types of reads; OR combining velvet assembly of Illumina reads with newbler assembly of 454 reads) is GAA (http://bioinformatics.oxfordjournals.org/content/28/1/13.full). I am sorry for forgetting to mention this in the question post.

I hope I am offering the right details you need. And thanks a lot for your response.

Lhl

ADD REPLY • link 11.1 years ago by Lhl ▴ 760

0

Entering edit mode

Hi Lhl.

That's a big genome. I'm surprised you were able to use Velvet on it (or not so surprising if all you're doing is assembling exons). Ray is probably a good choice. I am still not confident that they would do the best job with the combination of reads, but other programs need a bit of tweaking to get to run properly.

So the Illumina reads are not randomly sampled from the genome? Most assemblers expect an even coverage of the genome, and I guess some might not work well with that. If that is the case, I guess both Ray and Velvet would have problems with the combination of Illumina and 454 reads when the Illumina reads is uneven distributed. Someone more knowledgeable than me would have to explain the reasons.

I guess you could take the 454 assembly, map your Illumina reads to it, and use the Illumina reads to correct mistakes in the 454 assembly.

For which purposes do you need the assembly? If you just need the gene-rich regions, then you might have a good enough assembly. Combining assemblies, at least with uneven coverage, is not an easy task. I guess I need to read that GAA article to learn more about that.

Good luck.

Ole

ADD REPLY • link 11.1 years ago by Ole Kristian Tørresen ▴ 150

0

Entering edit mode

Hi Ole,

Thanks again for your response.

I am available to a University computer cluster. So basically I do not need to worry about computation resources. The Illumina reads are mainly sampled from the gene-rich genomic regions.

Since i only have a little bit 454 reads, which lead to a few contigs. I do not think i can use it as the base assembly and further improve it through mapping Illumina reads.

The main purpose of my study is to get as many functional components of the genome as possible.

Anyway, thanks for your discussion.

It helps to some extent.

Cheers,

Lhl

ADD REPLY • link 11.1 years ago by Lhl ▴ 760