Question

NGS, coverage and read length appropriate for sequencing bacterial DNA with the intention to assemble their genomes?

1

Entering edit mode

8.5 years ago

jerrybug109 ▴ 10

Hello bioinformaticians!

I'm new to bioinformatics and have just been assigned to a genomic dna sequencing + genome assembly project and would appreciate your advice for some basic questions I have!

We are conducting a population survey of ~90 strains of Bacillus Subtilis (genome size = 4 MB). We would like to do full genome sequencing on each of these strains. We already have many reference genomes sequenced and will use those as anchors.

We will purify DNA from each strain and will have 90 individual DNA samples. We want to send these DNA samples out to a university/company to be sequenced. Using that data, we intend to assemble/resequence the genome of each of these strains ourselves using Velvet, SPAdes, or an alternative genome assembler.

Right now, I have the responsibility of choosing where our DNA samples get sent out to: what NGS platform to use, and what run specifications to use for our project.

My issue right now is that I'm not sure which next generation sequencing platform is suitable for our project if we intend to do genome assembly as our goal. How do I pick between Illumina (Miseq, Hiseq, etc) vs. PacBio?

I also am unsure of what depth of coverage would be acceptable for our purposes; I've been told 10X should be good enough but that seems low to me - perhaps 20X would suffice?

Finally, I'm not certain what read length would be appropriate - do you know if paired-end reads of 2x150 bp or 2x250 bp would be good?

I'm from a different field so I have a lot to learn - I'd appreciate any pointers you have. Thanks!

Assembly genome genomics next-gen-sequencing • 4.8k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.5 years ago by jerrybug109 ▴ 10

Ram · Answer 1 · 2015-10-07

10X coverage is insufficient on Illumina or PacBio platforms for this, or pretty much any purpose.

The best way to assemble bacterial isolates is to aim for 1-2 SMRT-cells (~80x coverage) with PacBio; you'll typically get a near-perfect single-contig assembly using Falcon. For a 4Mbp genome 1 should be enough to usually get a 1-contig assembly, with high-quality input DNA and ideal loading.

If that's too expensive, targeting ~200x Illumina coverage and multiplexing 2x150bp on a HiSeq will get you decent assemblies, using Spades. They'll probably have 50-150 contigs and not very many errors.

If you simply want to map and call variations with respect to a reference (if the strains have not diverged much), 40x Illumina coverage of 2x150bp would be sufficient and probably the cheapest approach.