How to normalize K-mer counts for genes of different length

0

Entering edit mode

8.5 years ago

jolespin ▴ 150

So I'm trying to look at k-mer frequencies for a bunch of genes and they are different lengths. If they were all the same length then counts would be a good measure. I'm going to normalize them by dividing each count by the length of the sequence. Is that the right way to do it? Is there another normalization method that is typically used for this type of analysis?

gene RNA-Seq kmer genome sequence • 2.9k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.5 years ago by jolespin ▴ 150

1

Entering edit mode

Well, for obtaining frequencies, you should divide by the total number of counts. For looking at a particular gene, this will usually be L-k+1 (L=length of gene). However, it is more save to sum up the counts and then divide IMHO. For example, if you do some creepy kind of counting like counting only k-mers at even positions (for whatever reason!) and divide by L-k+1, your normalized count-vector would not sum up to 1.

[EDIT:] Depending on you downstream analysis, you can also normalize to a vector-length of 1 (Euclidean norm).

ADD REPLY • link updated 20 months ago by Ram 43k • written 8.5 years ago by Manuel Landesfeind ★ 1.4k

Login before adding your answer.