De novo genome assembly strategy
5
0
Entering edit mode
8.4 years ago
joneill4x ▴ 160

Assembling a genome de novo. I have:

  • 10X coverage with PAC-BIO reads
  • 100X coverage with Illumina short reads (150 bp paired-end reads)
  • 20X coverage with long MiSeq reads (max length 800 bp)

Given what I have to work with, what would be the best strategy to assemble the genome and why?

Thank you,
Joe

edit - genome size ~ 1Gb

Assembly sequencing genome • 4.7k views
ADD COMMENT
2
Entering edit mode

You should specify the genome type. Some tools will not be able to work on big genomes.

ADD REPLY
0
Entering edit mode

We have similar sets of data and I was wondering what you have decided to use at the end? Will also appreciate if you tell about your experience. Thanks

ADD REPLY
0
Entering edit mode

I ended up using DBG2OLC

What lead me there: https://github.com/PacificBioscience...Bio-Long-Reads

The publication: http://arxiv.org/ftp/arxiv/papers/1410/1410.2801.pdf

The code: http://sourceforge.net/projects/dbg2olc/

I'm quite pleased with the results of DBG2OLC.

I corresponded with the authors, managed to closely replicate the results from their paper, and made some pretty decent draft assemblies of my own with minimal data. Fast performance and good results.

ADD REPLY
3
Entering edit mode
8.4 years ago
Adrian Pelin ★ 2.6k

SPAdes should provide very nice results for your dataset. It will assemble your 100x using a multi k-mer approach, then it will resolve some repeats using your long MiSeq reads and it will scaffold additionally using PacBio.

http://bioinf.spbau.ru/spades

So you can use their suggested guidelines for 150bp reads:

spades.py -k 21,33,55,77 --careful <your reads> -o spades_output

You can specify pacbio as: --pacbio

Your 100x as: --pe1-1 and --pe2-1

and your single end MiSeq as --s2

ADD COMMENT
0
Entering edit mode

A nice tool. But it will work only for smaller genomes.

ADD REPLY
0
Entering edit mode

I have used it up to 150mb. Then again the OP did not mention what the genome size is.

ADD REPLY
0
Entering edit mode

Thanks Adrian. Using SPAdes was my first thought too. However, my genome size is large, ~ 1GB, so I don't think I can use it.

ADD REPLY
0
Entering edit mode

I found SPAdes and dipSPAdes to run extremely slow when using PacBio reads as input.

ADD REPLY
1
Entering edit mode
8.4 years ago
Juke34 8.5k

Allpaths-LG can be a solution, it will perform the assembly from illumina short reads and then a scaffolding using the PacBio data.

For illumina reads, it needs a high coverage (100x), so for your case it's fine, but in other hand it needs very specific libraries (3 kbp matepair ?). You should check.

ADD COMMENT
0
Entering edit mode

Thanks Juke.

ADD REPLY
0
Entering edit mode

IIRC ALLPATHS-LG requires overlapping PE and one short mate-pair library. So it may not work if the above libraries don't fit this specification.

ADD REPLY
1
Entering edit mode
8.4 years ago

ALLPATHS‐LG requires a minimum of 2 paired‐end libraries - one short and one long. The short library average separation size must be slightly less than twice the read size, such that the reads from a pair will likely overlap - for example, for 100 base reads the insert size should be 180 bases. The distribution of sizes should be as small as possible, with a standard deviation of less than 20%. The long library insert size should be approximately 3000 bases long and can have a larger size distribution. Additional optional longer insert libraries can be used to help disambiguate larger repeat structures and may be generated at lower coverage

EDIT: Copied from the manual

ADD COMMENT
1
Entering edit mode
8.4 years ago
Juke34 8.5k

You also can use MaSuRCA mega-reads.

Masurca in general gives relatively good results.

It is one of the rare real hybrid assembler (De Bruijn/OLC)

ADD COMMENT
1
Entering edit mode

Thanks Juke. However, I don't think I should use it for my task because "We note that the modified version of CABOG 6.1 used in MaSuRCA is not capable of supporting the long high-error-rate reads generated by the PacBio technology."

ADD REPLY
0
Entering edit mode
8.4 years ago
joneill4x ▴ 160

*Deleted

ADD COMMENT

Login before adding your answer.

Traffic: 1972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6