Question

problem in estimatining BAC size using k-mer method

0

Entering edit mode

7.7 years ago

gangireddy ▴ 160

Hi people,

I am trying to assemble BAC clone sequence from pacbio. The assemble using Celera Assembler and canu are both resulting one contig assembly but with a difference in length of 15kb.

so, in order to estimate the target size. I followed the link below:

K-mer analysis and genome size estimate

and the graph obtained is as follows with two peaks and it is not giving following poisson distribution. I am confused which peak to choose for calculating the estimate target size. either of the peaks give two completely different target genome size estimates.

Assembly • 2.2k views

ADD COMMENT • link updated 7.7 years ago by SES 8.6k • written 7.7 years ago by gangireddy ▴ 160

0

Entering edit mode

Can you link to the image of the distribution? If you're dealing with a diploid, you should probably use the second peak, but seeing the distribution would clarify things a lot. Also, what organism is it for?

ADD REPLY • link 7.7 years ago by Brian Bushnell 20k

0

Entering edit mode

image link

it is the sequence of B.mori BAC

ADD REPLY • link 7.7 years ago by gangireddy ▴ 160

1

Entering edit mode

That does not really look like 2 peaks to me, but rather, one jagged peak. Normally, for one peak, the genome size is the area under the curve excluding error kmers. But in this case there is no clear distinction. I agree with other comments that this is not really a good scenario to try kmer-based genome-size estimation.

ADD REPLY • link 7.7 years ago by Brian Bushnell 20k

score 0 · Answer 1 · 2016-07-20

0

Entering edit mode

7.7 years ago

SES 8.6k

I would personally forget about the k-mer approach. Having worked with BACs for years, in the wet lab and computationally, I would simply look up the library information. You should be able to find the average insert size for your library, and if there is a physical map you may be able to find information about the clone. It depends on how/who made the library, but there will be quite large differences between BACs regardless of the assembler. That first step will tell you if you are in the ballpark in terms of assembly size.

What you are showing is also expected, which is differences between assemblers. I don't think those numbers are unexpected. You just have to decide which is more likely correct based on the data (bearing in mind those tools were designed for different purposes), the biology, and the assembly statistics. The classic question of what is "better" kind of depends on what you want to do. A larger N50 or total length isn't necessarily more correct. To me, that statistic on the length doesn't mean very much without some context. Sorry if that is vague but I can be more specific if you'd like to provide more information.

ADD COMMENT • link 7.7 years ago by SES 8.6k

0

Entering edit mode

What if BAC is the "genome"? target size reference keeps things vague though.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Can you elaborate please? I'm not sure what you are suggesting with either part of the comment. The sequence/size of a BAC vector is already known. What you are trying to determine is the insert size (of the clone).

ADD REPLY • link 7.7 years ago by SES 8.6k

1

Entering edit mode

I was thinking that OP is trying to use k-mer information alone to estimate the size of the "genome" (which in this case would be whole BAC). It is possible that my thinking is completely off target.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

No worries, the post is not really a computational/bioinformatics topic. When you do a BAC prep/digest to extract the clone the common approach is to run it out on a gel for QC before sequencing. Whoever picked the clone would have done that. The smart approach would be to gather this info instead of trying computational approaches IMO.

ADD REPLY • link 7.7 years ago by SES 8.6k

0

Entering edit mode

The average size of library is 168 kb and the assembled contigs have sizes of 219650 && 234947. I don't think it was run on gel as the sequences also contain e.coli sequences which I have removed mannually.

ADD REPLY • link 7.7 years ago by gangireddy ▴ 160

1

Entering edit mode

All BAC data contains e. coli initially, that is the vector. This is the main reason to know the BAC library/clone information (to clean up the data). Sizing the insert on a gel would be done before the library is made, which is long before the sequencing is done (you can search the web for protocols). Those assembly size ranges are normal in my experience, and that looks like a nice BAC library! The typical approach is to try to "finish" the BAC as much as possible rather than focusing on the exact size.

ADD REPLY • link 7.7 years ago by SES 8.6k

0

Entering edit mode

we are trying to do a denovo assembly of a chromosome using BAC library. so, if BAC asslemblies are not upto mark then the final assembly might have more problems. the difference of size is around 15 kb. this is what worries me.

ADD REPLY • link 7.7 years ago by gangireddy ▴ 160