Representing a sequence as k-mer composition
1
1
Entering edit mode
6.9 years ago
mdurrant ▴ 10

Hello,

Could someone please explain this paragraph from from McHardy et al., 2007:

Compositional sequence patterns.
For ompositional feature analysis, we map a given piece of DNA sequence to a higher-dimensional space of nucleotide patterns o = {o1, o2, ..., oq}, where o is defined by the pattern length w and the number of literals l. In this space, s is represented by the compositional input vector v = (a1, a2, ..., aq); where ai is the frequency of pattern oi in s. Input vectors are normalized by the total number of patterns for each sequence.

I specifically want to understand how they would generate o from a given pattern length w and number of literals l. How would this be applied to an example DNA sequence?

sequence DNA kmer • 1.7k views
ADD COMMENT
1
Entering edit mode
6.9 years ago
marsvetlana ▴ 10

Hi! w - is "word" length and l - is a number of "letters": in the "alphabet". Usually, there are four letters in DNA alphabet (a,c,t,g), w is defined by researcher. If for example, w = 2, with l = {a,c,t,g} we have o = {aa,ac,at,ag,ca,cc,ct,cg,ta,tc,tt,tg,ga,gc,gt,gg}. In this case, any DNA sequence can be characterized by 16 numbers, each of them represent frequency or number of occurence of one of these patterns (k-mers/motifs/words). The quantity of possible patterns is W in the power of L.
Hope it helps

ADD COMMENT

Login before adding your answer.

Traffic: 2272 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6