Question

How to extend BAC sequences with Pacbio reads in order to rebuild a genomic region?

0

Entering edit mode

6.5 years ago

Thomas B. ▴ 30

Hi all,

I am focusing on the reconstruction of a ~2Mb plant genomic region. What I have is some BAC sequences from there and Pacbio reads from the whole genome.

I am interested in extending the non-overlapping BAC sequences with Pacbio reads up to be able to merge them and obtain a unique reference sequence (per haplotype).

I read this post (A: Extending ends of sequences with the help of reads?) and decided to start with tadpole. As a preliminary test, I merged the reads from 3 overlapping BACs and tried the following command:

tadpole.sh in=bac.fasta extra=reads.fasta out=extended_bac.fa extendleft=10000 extendright=10000 ibb=f mode=extend k=62

The output sequence was 3 nt longer. That is not as much as expected, but it worked.

Then I tried the same command but using the whole genome dataset. Unfortunately, it ran out of memory, even when using the -Xmx20g or -Xmx200m options.

It should be said that in the latter case I used reads that were already error-corrected and trimmed by Canu.

I also wanted to normalize the data to decrease coverage using BBNorm until I read that it only suits for short reads. I however found no other way for that purpose.

Now here are my questions:

Is there a way to work on the genome dataset? And providing it is possible, is there a trick to get longer extensions?

Thanks in advance !

Assembly tadpole • 1.6k views

ADD COMMENT • link 6.5 years ago by Thomas B. ▴ 30

0

Entering edit mode

What is your Pacbio coverage of the whole genome ? I presume low coverage, if not just use Canu to assemble the whole thing. You'll need more than 20GB of RAM for this project though.

I would aim more at using Masurca or Miniasm (followed by Racon) to try to assemble the Pacbio reads, as these contain the most structural information.

ADD REPLY • link 6.5 years ago by colindaven 6.4k

0

Entering edit mode

The whole genome size is estimated at 770Mb. The raw dataset is ~5.8M reads (coverage ~75x) and the error-corrected/trimmed dataset is ~1.4M reads (coverage ~20x).

We did assemble the whole genome using Canu and Falcon, but there exist structural mis-assemblies in both cases (not able to align BAC sequences in their entire length).

Masurca is an alternative to those softwares. But it seems to not have been tested on a large dataset like the one I have, and requires both Illumina and Pacbio data.

Racon is a consensus algorithm. I already tested some like pbdagcon and samtools. They both worked well but did not perform any extension of the reference sequences. Is Racon able to do that?

ADD REPLY • link 6.5 years ago by Thomas B. ▴ 30

0

Entering edit mode

You're right about Racon, I corrected my comment.

Sounds like a good dataset.

I believe SSPACE longread may be able to extend assemblies, but have never done this.

Basically, if Canu and Falcon - both leading assemblers - have failed, then the main alternative would be HGAP. 2MB is big for a (multi)BAC region. Are all the BACs completely confident ? Were they all assayed with PacBio ?

I just found one alternative I have starred but not tried: https://github.com/ruanjue/smartdenovo

I couldn't find the hybrid assembler I used in the past again, but don't think results would be better than Canu et al ..

ADD REPLY • link 6.5 years ago by colindaven 6.4k

0

Entering edit mode

So here is the other alternative

https://github.com/yechengxi/DBG2OLC

ADD REPLY • link 6.5 years ago by colindaven 6.4k

0

Entering edit mode

The BACs were individually assembled using HGAP. Each output sequence was checked considering read alignment (BAM file), size (fingerprinting), and BAC ends (Sanger). I cannot exclude some errors but I am confident in this dataset.

You are right, scaffolding is maybe a good option and I'll look into that.

Now I would like to come back to my specific question. Is there a way to run tadpole using the BAC sequences and reads from the whole genome? What do you think about sub-sampling or normalization of the data?

ADD REPLY • link 6.5 years ago by Thomas B. ▴ 30