Question

kmergenie keeps suggesting the highest kmer in range as the suitable one (?)

1

Entering edit mode

8.7 years ago

pbigbig ▴ 250

Hi,

I run kmergenie on a merged data file (concatenated from 6 individual fastq files: 3 R1 and 3 corresponding pair-end R2)

$ ./kmergenie -k 40 -l 16 -s 2 -t 16 -o 16-40histo merged.fq.gz

and get the results:

Predicted best k: 40
Predicted assembly size: 772582595 bp

Then I run

$ ./kmergenie -k 60 -l 42 -s 2 -t 16 -o 42-60histo merged.fq.gz

which get:

Predicted best k: 60
Predicted assembly size: 802168662 bp

Then I run

$ ./kmergenie -k 120 -l 60 -s 10 -t 16 -o 60-120histo merged.fq.gz

also obtain:

Predicted best k: 120
Predicted assembly size: 831614678 bp

Here is the graphs created by kmergenie: https://drive.google.com/file/d/0Bx7ogD6DesvEWE1MWVNQU0pGN00/view?usp=sharing

Beside, the kmer multiplicity graphs seem to be okay with clear 2nd peak, which decrease from around 27 to around 8 while k increase from 16 to 120.

I don't know why the suggested best k keep going higher, is maximizing total number of distinct k-mer always the good choice? or could there be something wrong with our data or any parameter used with kmergenie? Our pair-end data are from Miseq which have read lengths range from 35 to 310 with very sharp peak at 310bp.

Welcome any suggestion and help, thank you very much!

Phuong

kmergenie • 4.3k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by pbigbig ▴ 250

Ram · Answer 1 · 2015-08-10

I have used KmerGenie a number of times on HiSeq Paired End reads and it's not that weird that the kmer is high.

KmerGenie takes a certain Kmer, for example it starts with 16, and looks at how many unique kmers it finds when doing that. Because the kmer is small the sequences obtained in that window are small and so easily match, thus you will have a low number of unique matches.

When the kmer becomes too large you will to find that it will start yielding lower numbers of sequences due to certain reads being shorter and certain kmers can't be extracted from that read (for example if your read is 60bp and you kmer is 100 then it can't get anything in that window because the read is too small) or because if the kmer is larger there are less possible kmers in your read so they become more unique but there are less possibilities.

Somewhere in that middle is the peak.

So when keeping that in mind, it is perfectly normal to have these graphs when you have larger reads. You use ~300bp reads so if you want to see the peak just run KmerGenie somewhere around 100-250bp and you will probably see it peak there.

If it is wise to actually use such high Kmers (and if your assembler accepts those high kmers) is another question which I unfortunately don't know the answer to. Some assemblers have a maximum possible kmer but even then I don't know if your assembly quality improves if you use such high kmer values.

Ram · Answer 2 · 2015-08-10

0

Entering edit mode

8.7 years ago

Rayan Chikhi ★ 1.5k

I've taken a look at your histograms and will answer some of the points the (nice) answer above didn't cover.

is maximizing total number of distinct k-mer always the good choice?

We observed that it generally is. You can verify this in your case by running an assembler twice (small/large k) and compare the assembly quality.

or could there be something wrong with our data or any parameter used with kmergenie?

Nope, your command lines and result histograms look normal.

What is the expected genome coverage of your library? Generally, the higher the coverage, the larger the k value for assembly should be.

It looks like you have high coverage, and in fact, you could test for k values well above 120. Try running kmergenie with -k 310; you might need to recompile it (make clean && make k=310).

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by Rayan Chikhi ★ 1.5k

0

Entering edit mode

Thank for your very much for your answers! I will test it for higher k

Best rgds.

ADD REPLY • link 8.7 years ago by pbigbig ▴ 250

0

Entering edit mode

Hi Rayan,

I have obtained best kmer suggested by kmergenie at k=127.

However, I observe that the peak multiplicity from 127-mer histogram is at 7 (this is kmer coverage depth), from which i could calculate the Base coverage = 7*300/(300-127+1)=12.1. Meanwhile, if I try to run k=29 for example, it would show peak multiplicity at 25, then Base coverage = 25*300/(300-29+1)=27.6

So the k we choose will affect significantly to the Base coverage depth? But as I understand, this Base coverage is an intrinsical characteristic of our read data indicating how depth genome bases are covered by them, this base coverage should be a fixed value belong to our raw data, isn't it?

I would be grateful to hear some suggestion and explanation from you, thank you very much!

Phuong.

ADD REPLY • link 8.7 years ago by pbigbig ▴ 250

0

Entering edit mode

Hi Phuong,

I'm not sure if this base coverage formula is accurate?

First, the peak multiplicity is not necessarily the mean, as kmer coverage is generally not a perfect Gaussian distribution (for example, due to repeats). Second, I'm not sure if the number of read bases per kmer is relevant here (the ratio of the read length divided by the number of kmers in a read, that you had in your formulas). Consider that many read bases were "discarded", as they were in erroneous kmers.

What does it mean to compute a base coverage in that context? There is a total amount of bases in correct kmers, that can be computed by (total_number_of_correct_kmers * k) or (mean_multiplicity * number_of_distinct_kmers * k). This number of bases, divided by the estimated genome size, is in my opinion an estimate of base coverage of the correct kmers. In contrast, classical base coverage (of read bases) is (number_of_reads * read_length / genome size).

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Rayan Chikhi ★ 1.5k

0

Entering edit mode

Thanks Rayan,

I adapt this calculation from aaronrjex's post in this thread: http://seqanswers.com/forums/showthread.php?t=10988

You can relate this value to the actual coverage of your genome using the formula M = N * (L - K + 1) / L, where M is the mean kmer coverage, N is the actual coverage of the genome, L is the mean read length and k is the kmer size.

L -k +1 gives you the number of kmers created per read.

So basically what the formula says is the kmer coverage for a genome is equal to the mean read coverage * the number of kmers per read divided by the read length.

Because you know L (your mean read length) and k (the kmer you used to estimate peak kmer coverage) and you've calculated M (soapdenovo comes with a script called kmerfreq that will this), you simply solve the equation for N as:

N = M/((L-k+1)/L) and calculate N.

Once you have that, divide your total sequence data by N and you have your genome estimate."

So I am confused that with different k, we obtain different N (12.1 or 27.6 as in examples I made), which lead to different genome estimated results (because our total sequence data is fixed).

I am not sure if this approach is correct or not. If possible, could you explain the formula which kmergenie used to make its estimated genome size results? Thanks a lot!

Phuong

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by pbigbig ▴ 250

0

Entering edit mode

Sequencing errors are not taken into account in this formula. Yet, a significant fraction of k-mers in reads contain sequencing errors, thus are useless for assemblers, and thus are discarded by kmergenie. Kmergenie estimates assembly size by attempting to count only non-erroneous kmers based on the frequency distribution of all kmers. It's formula is just: estimated assembly size ~= number of non-erroneous k-mers. Note that kmergenie does not attempt to estimate genome size, because estimating the contribution of repetitive kmers is more tricky to handle.