How to extend BAC sequences with Pacbio reads in order to rebuild a genomic region?
0
0
Entering edit mode
6.5 years ago
Thomas B. ▴ 30

Hi all,

I am focusing on the reconstruction of a ~2Mb plant genomic region. What I have is some BAC sequences from there and Pacbio reads from the whole genome.

I am interested in extending the non-overlapping BAC sequences with Pacbio reads up to be able to merge them and obtain a unique reference sequence (per haplotype).

I read this post (A: Extending ends of sequences with the help of reads?) and decided to start with tadpole. As a preliminary test, I merged the reads from 3 overlapping BACs and tried the following command:

tadpole.sh in=bac.fasta extra=reads.fasta out=extended_bac.fa extendleft=10000 extendright=10000 ibb=f mode=extend k=62

The output sequence was 3 nt longer. That is not as much as expected, but it worked.

Then I tried the same command but using the whole genome dataset. Unfortunately, it ran out of memory, even when using the -Xmx20g or -Xmx200m options.

It should be said that in the latter case I used reads that were already error-corrected and trimmed by Canu.

I also wanted to normalize the data to decrease coverage using BBNorm until I read that it only suits for short reads. I however found no other way for that purpose.

Now here are my questions:

Is there a way to work on the genome dataset? And providing it is possible, is there a trick to get longer extensions?

Thanks in advance !

Assembly tadpole • 1.6k views
ADD COMMENT
0
Entering edit mode

What is your Pacbio coverage of the whole genome ? I presume low coverage, if not just use Canu to assemble the whole thing. You'll need more than 20GB of RAM for this project though.

I would aim more at using Masurca or Miniasm (followed by Racon) to try to assemble the Pacbio reads, as these contain the most structural information.

ADD REPLY
0
Entering edit mode

The whole genome size is estimated at 770Mb. The raw dataset is ~5.8M reads (coverage ~75x) and the error-corrected/trimmed dataset is ~1.4M reads (coverage ~20x).

We did assemble the whole genome using Canu and Falcon, but there exist structural mis-assemblies in both cases (not able to align BAC sequences in their entire length).

Masurca is an alternative to those softwares. But it seems to not have been tested on a large dataset like the one I have, and requires both Illumina and Pacbio data.

Racon is a consensus algorithm. I already tested some like pbdagcon and samtools. They both worked well but did not perform any extension of the reference sequences. Is Racon able to do that?

ADD REPLY
0
Entering edit mode

You're right about Racon, I corrected my comment.

Sounds like a good dataset.

I believe SSPACE longread may be able to extend assemblies, but have never done this.

Basically, if Canu and Falcon - both leading assemblers - have failed, then the main alternative would be HGAP. 2MB is big for a (multi)BAC region. Are all the BACs completely confident ? Were they all assayed with PacBio ?

I just found one alternative I have starred but not tried: https://github.com/ruanjue/smartdenovo

I couldn't find the hybrid assembler I used in the past again, but don't think results would be better than Canu et al ..

ADD REPLY
0
Entering edit mode

So here is the other alternative

https://github.com/yechengxi/DBG2OLC

ADD REPLY
0
Entering edit mode

The BACs were individually assembled using HGAP. Each output sequence was checked considering read alignment (BAM file), size (fingerprinting), and BAC ends (Sanger). I cannot exclude some errors but I am confident in this dataset.

You are right, scaffolding is maybe a good option and I'll look into that.

Now I would like to come back to my specific question. Is there a way to run tadpole using the BAC sequences and reads from the whole genome? What do you think about sub-sampling or normalization of the data?

ADD REPLY

Login before adding your answer.

Traffic: 1979 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6