Question

How to assemble complex soil metagenome datasets?

2

Entering edit mode

6.8 years ago

Lina F ▴ 200

Hi all,

I have 27 soil WGS metagenome datasets and I am trying to assemble them into contigs that are at least 1000-2000 kb long. Each dataset on its own is 20-30 Gigabytes of paired-end read fastq files.

I first tried the Ray Meta assembler because it's supposed to run well in parallel. I was able to do that for most datasets but have gotten very short contigs (most are <500 kb). Then I found this paper that suggests it does better for low-complexity datasets.

I also took a look at Concoct and I think the strategy sounds like it makes sense, but the code on their github pages is woefully outdated and I'm not sure how much of it is still maintained. Also, it suggests combining all datasets into one and then trying to assemble it (a "coassembly") and using that for the downstream analysis, but that approach will be computationally challenging since my datasets are so large.

If anyone has any experience with assembling complex soil datasets, I'd love to do some brain storming, so please reach out!

Thanks!

metagenomics assembly soil • 4.7k views

ADD COMMENT • link updated 6.8 years ago by Joe 21k • written 6.8 years ago by Lina F ▴ 200

1

Entering edit mode

6.8 years ago

Joe 21k

My suggestion would also have been CONCOCT. It's developed by Chris Quince in my department. It is still under active development but the docs etc are a bit out of date as you say.

It's specifically designed for assembling metagenomes though, so I'd give it a try.

ADD COMMENT • link 6.8 years ago by Joe 21k

0

Entering edit mode

Thanks for the feedback! I was able to do a coassembly of 7 samples (the first part of my dataset) using Megahit, and now running Concoct should be within reach :)

It's also good to hear that it's still under active development! I've been looking at the github directory -- is this a good place to keep an eye on for new developments?

ADD REPLY • link 6.8 years ago by Lina F ▴ 200

score 3 · Accepted Answer · 2017-07-05

3

Entering edit mode

6.8 years ago

Brian Bushnell 20k

The best approach for contiguity is generally to coassemble if you have sufficient resources. Some assemblers (HipMer/MetaHipMer, Ray, and Omega/Disco) can distribute that to spread the memory use across multiple nodes... since you've tried Ray, you might try the others too and see if they give better results. For a single node, we've found Megahit gives the best results with the lowest resource consumption.

You can also try approaches such as binning (using e.g. Metabat) and then assembling just reads that map to each individual bin. Normalization, error-correction, and/or discarding low-depth reads can also improve assemblies. With Ray and Disco, both error-correction and merging paired reads prior to assembly increases continuity.

But in general it's not a solved problem, so you'll have to experiment a lot! Don't expect great continuity, though; complex metagenomes often yield an L50 (length) of 200bp or less.

Note that you may be able to bin the raw reads using a binning tool based on depth covariance if the 27 samples are different (different conditions, location, time, etc).

ADD COMMENT • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for the feedback! I tried MegaHit and reduced assembly time for a single sample from 11 hours (for Ray) to 2 hours. This was on a large AWS EC2 instance.

I also managed to do a coassembly for 7 samples in 19 hours using Megahit, so this is very encouraging!

ADD REPLY • link 6.8 years ago by Lina F ▴ 200

0

Entering edit mode

Yep, Megahit is a great tool, and I highly recommend it.

I'd be remiss to mention, though, that speed is not the only metric you should be considering. Please, at a minimum, check the basic assembly stats (N50, L50) also. For example - I can guarantee you that BBMap's Tadpole is faster than Megahit or Ray, but that does not mean it's better. Rather, it has fewer misassemblies, but the contiguity is lower. The choice of assembler is dictated by your goals.

In general - I'd choose the assembler that gives the best results pursuant to your goal, rather than the fastest one. Sometimes that means the best contiguity (in which case I'd suggest SPAdes), sometimes that means the fewest misassemblies (in which case I'd suggest Tadpole), and sometimes that means the best balance of contiguity, accuracy, time, and resource usage (in which case I'd suggest Megahit).

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k