Question

puzzled about "-kmer" options during de novo assembly

0

Entering edit mode

5.8 years ago

Yingzi Zhang ▴ 90

Hi all, I am puzzled about "-kmer" options during de novo assembly.

First, I did k-mer frequency analysis.

Reported:

For P(x): Possible peaks including: 100 the unique peak is 100

For F(x): Possible peaks including: 10 103 the unique peak is 103

Raw kmer depth estiamtion:

Curve peak expect_depth

k-mer species 100 100.687

k-mer individuals 103 102.643

Thus I thought the kmer depth of my data is about 101. I thought I should use this value in the following analysis.

Then I began to correct sequencing errors and trim reads containing singleton kmers using bfc. I got advice from a boss. He said I just need to set -kmer value as 61. (my data is 100bp x 2) I once read another paper which set -kmer 61 also. So is it right to just set kmer value as 61? Is there nothing to do with my own data? Why? Thank you.

Yingzi

assembly sequencing • 1.7k views

ADD COMMENT • link updated 5.8 years ago by lieven.sterck 15k • written 5.8 years ago by Yingzi Zhang ▴ 90

0

Entering edit mode

You can also use kmergenie to find the optimal range for the assembly.

Generally speaking though people tend to keep 2/3rds of the read length as the kmer however it is always better to have multiple assemblies, and evaluate the same.

ADD REPLY • link 5.7 years ago by harish ▴ 450

score 2 · Accepted Answer · 2018-07-04

2

Entering edit mode

5.8 years ago

lieven.sterck 15k

The kmer you use for Kmer-freq analysis is not (or does not have) to be related to the kmer you use for the actual assembly and certainly not with the peak value of your freq analysis

The rule-of-thumb is set it at approx 2/3 of your read length (at least initially), so in that sense 61 is probably not a bad choice. It certainly can NOT be bigger than your read length!

However the kmer story is much more complex then this, it also has to do with your data quality, the heterzogosity level of your species etc

ADD COMMENT • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

Cool. additionally, where should kmer peak value be used, would you please explain a little bit? I know kmer frequency analysis can help estimate genome size and the extent of heterozygosity, is that where peak value be used? Also, I don't know how to evaluate the heterzogosity level (unfortunately I have to evaluate because some options depend on them). If it reported like this, is the heterzogosity level low enough?

for hybrid: a[1/2]=0.226337 a1=0.728825

kmer-species heterozygous ratio is about 0.12761

for hybrid: b[1/2]=0.167228 b1=0.569748

kmer-individual heterozygous ratio is about 0.0912432

ADD REPLY • link 5.8 years ago by Yingzi Zhang ▴ 90

1

Entering edit mode

Kmer peak value is (to my knowledge) only used in genome size estimations indeed.

You can always upload your Kmer count table to for instance GenomeScope website , which will give you a nice overview (and graphs) of how your data looks like, including the heterozygosity estimation

minor EDIT: some assembly software (eg. ABySS) does uses this kmer-freq plot info to find for instance the lower-bound coverage (below which data is considered noise)