Question

Why we need 100X coverage to get a high-quality assembly?

4

Entering edit mode

7.4 years ago

lamz138138 ▴ 40

Hi!

When performing genome assembly, it usually need high coverage data, such as more than 50X for pacbio, in my opinion, 10X would be enough to correct the high-error reads and assembly the genome. Considering we can't get the uniform coverage across the genome, thus we increase the total coverage to ensure most region were covered by 10 reads to perform correction. Also, as for illumina and sanger data, the high coverage will make most region be covered. Am I right?

If I am right above, do we really need 100X, how do we know the minimal dataset we need? Is any protocol available to get uniform genome DNA from blood or tissues to avoid the high coverage?

Any suggestion would be grateful! Best wishes!

genome Assembly PacBio Coverage • 8.0k views

ADD COMMENT • link 7.4 years ago by lamz138138 ▴ 40

0

Entering edit mode

It depends on organism (size of genome) and its genomic structure (complexity, repeats ...), also there is different protocols suggest that 20x PacBio is enough, another saying 10X PacBio with 50X SR (short read will do the job); and if you have low complexity genome you could use mix of Paired-end reads with jumping libraries (Mate-pair)

funds and resources also play a role in deciding what to do, for example you could get 100X for bacteria but it is hard to do the same if you have large genome. also you need to put in your consideration Nanopore sequencing it is doing great job now.

ADD REPLY • link 7.4 years ago by Medhat 9.7k

0

Entering edit mode

Hi, Medhat!

Thank you for your reply. I am planning to assembly human-related species, but with 100X pacbio data is too expensive, so I wondering whether lower coverage is possible. However, according to the recently paper of human, gorilla, they all 100X data, and further integrated with BioNano and HiC data in human. I am curious about why we need so much data? As in falcon it suggested use 20X to 25X data in the assembly step, so we just use such a high data in correction, can we reduced the data?

By the way, to correct pacbio data with short reads, we can use pbcr, canu and falcon don't support this model, is any tools available?

Best wishes!

ADD REPLY • link 7.4 years ago by lamz138138 ▴ 40

2

Entering edit mode

First coverage depends on question you are trying to answer. in case of canu, the paper suggests that 20X PacBio read coverage is sufficient alone for good assembly (also PBcR is not supported any more).

in general if you will use PacBio in your project this picture (even though it is little old) it will give you a general idea about coverage and tools enter image description here

conclusion:

20X PacBio is good average coverage you can assembl with it alone using canu/HGAP.
If you have less that 20X you could compine it with average 50X short reads and assemble using tools like dbg2olc
in case of the tool you want to use for the assembly do not correct the PacBio reads there are tools for that like (LoRDEC, proovread, jaba) then you can use the result for the assembly

this paper will give you in depth info about libraries and assembly tools http://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-10

Good luck

ADD REPLY • link 6.4 years ago by Medhat 9.7k

0

Entering edit mode

Hi, thanks for your reply again!

According to the menu of CANU, I think we need more than 30X data to get the 20X correted data. All together, it seemed we couldn't know whether we can get a better genome before assembly with real data.Maybe I could sequence 20X data then assembly with different tool to evaluated the assembly quality, but it is dangerous if I can't get a better genome assembly...... Thanks for your suggestion!

ADD REPLY • link 7.4 years ago by lamz138138 ▴ 40

0

Entering edit mode

Bacterial genomes are tiny, simple, and non-repetitive compared to humans... and, crucially, haploid. But even so, you won't get a high quality assembly with 20x PacBio data. It depends on the genome and data, but roughly, 100x will get you a single-contig assembly, under ~50x usually won't, and in between is "maybe". Once you drop too low, the quality of the assembly rapidly deteriorates. This is very different from Illumina assemblies, in which the coverage has a comparatively minor effect on the assembly quality, because the error rate is very low.

Remember that 20x is only 10x per ploidy. Also, PacBio reads are often claimed to have a random distribution of errors, which make them easy to correct, but that's not true; the errors have a biased distribution, which is part of the reason a substantial amount of coverage is needed for correction.

If you want to reduce costs, you might consider a hybrid approach using Illumina reads to correct PacBio reads, as Illumina bases are much cheaper.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi, Brian Bushnell, thanks for reply!

Could you explain 20X is only 10X per ploidy more? In my opinion, it would reverse the sequence to get the overlap between reads, then correct them, so the dipliod wouldn't influence the effect of correction. But in the final step, dipliod would result in alternative paths, did a lot of alternative paths make it difficulty to find a correct path, or there would be a false path between plus strand and minus strand, that is, why should I care about the ploidy?

Except for correction, what other reasons for a lot coverage? Is it because some region were less abundant when extracting DNA, making them sequenced less. Also, some reads were biased sequenced, so we had to increased overall coverage? If so, is any experimental method available to get uniform DNA across the genome?

Could you give me suggestion on whether the project should be started? Considering the previous genome assembly was constructed by 7X sanger data, I think with 100X pacbio data, it would get a better genome but I don't know how to evaluate the possibility, the instinct is very dangerous...... Also, I plan to get 100X data with Sequel, however, as Sequel isn't as good as it said at the moment, and the price isn't fall, I still considering whether I should do such a project.

For the assembly method, I had simulate pacbio, illumina data for three chromosomes by removing the gap, it seemed falcon give the best result, follow by canu, and the hybrid is last. So although I have 270bp, 500bp, 2K, 5K short reads (I had assembly them with SOAP, ALLPATHS-LG, and get a lot of contigs), I think previously they could be used only after the genome was assembled. Now as 100X data seemed impossible, I think of sequence 20X pacbio data, then correct with short reads using pbcr, then assembly with falcon and canu to evaluate the effect (This is in danger too as I didn't evaluate whether the project should start). Do you have any suggestions? Also, I am trying correct my 2G pacbio data with these short reads to see the result.

For the biased error, I had manually check the mapping of my pacbio data, it seemed there would be a "T" insert between "G" and "T", but it just my instinct.

Best wishes!

ADD REPLY • link 7.4 years ago by lamz138138 ▴ 40

0

Entering edit mode

Since humans have a relatively low het rate, the ploidy might not make a huge difference, but it does cause problems because it's difficult to tell the difference between errors and heterozygous variations. And yes, the point of high coverage is the error-correction, which is fairly difficult. PacBio has a very uniform coverage distribution so you don't need extra coverage to make up for low-depth regions - they don't really exist (assuming you are using unamplified data).

If you simulate 20X PacBio data and try assembling it, you'll get an idea of what kind of assembly you might get. See this post for generating diploid data.

C: What is the coverage when allele frequency is < 1?

Incidentally, I have a program mutate.sh) in the BBMap package that will generate mutant genomes, and another randomreads.sh) that will generate synthetic data. It has a PacBio mode for modeling the PacBio error profile. You'd use it something like this:

1) Index the genome
bbmap.sh ref=hg19.fasta

2) Generate reads
randomreads.sh out=reads.fq.gz reads=1000000 minlength=500 maxlength=15000 pacbio adderrors=f

Despite the "adderrors=f" flag, it will add PacBio errors.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi, sorry for the delay reply!

In fact, I had used BBMap to produce my short reads and it is very useful for me to evaluate the assembly effect. Also, thanks for all suggestions.

Best wishes!

ADD REPLY • link 7.4 years ago by lamz138138 ▴ 40

0

Entering edit mode

Take a look at the 10X Chromium. If you're doing a diploid, vertebrate, then you can probably get pretty good results with this plus some standard Illumina libraries. I think PacBio, unless you've got the budget and a Sequel, is going to be prohibitively expensive. Don't discount things like DISCOVAR + 60x Illumina coverage + Long Mate Pairs, they do a very good job, and you can add in BioNano data later.

ADD REPLY • link 7.4 years ago by User 59 13k