Question

Representing a sequence as k-mer composition

1

Entering edit mode

6.9 years ago

mdurrant ▴ 10

Hello,

Could someone please explain this paragraph from from McHardy et al., 2007:

Compositional sequence patterns.
For ompositional feature analysis, we map a given piece of DNA sequence to a higher-dimensional space of nucleotide patterns o = {o1, o2, ..., oq}, where o is defined by the pattern length w and the number of literals l. In this space, s is represented by the compositional input vector v = (a1, a2, ..., aq); where ai is the frequency of pattern oi in s. Input vectors are normalized by the total number of patterns for each sequence.

I specifically want to understand how they would generate o from a given pattern length w and number of literals l. How would this be applied to an example DNA sequence?

sequence DNA kmer • 1.7k views

ADD COMMENT • link updated 6.9 years ago by marsvetlana ▴ 10 • written 6.9 years ago by mdurrant ▴ 10

score 1 · Answer 1 · 2017-06-12

Hi! w - is "word" length and l - is a number of "letters": in the "alphabet". Usually, there are four letters in DNA alphabet (a,c,t,g), w is defined by researcher. If for example, w = 2, with l = {a,c,t,g} we have o = {aa,ac,at,ag,ca,cc,ct,cg,ta,tc,tt,tg,ga,gc,gt,gg}. In this case, any DNA sequence can be characterized by 16 numbers, each of them represent frequency or number of occurence of one of these patterns (k-mers/motifs/words). The quantity of possible patterns is W in the power of L.
Hope it helps