DNA sequences degeneration
2
0
Entering edit mode
7.8 years ago
Chadi Saad ▴ 100

What's the fastest way to degenerate exact DNA sequences (motifs of length L) using IUPAC ambiguity code. In other words, how to obtain all the possible IUPAC motifs from DNA exact motifs. I search for an efficient algorithm.

For example

I have these motifs of length 3 :

  • AAA
  • ATA
  • TTA
  • TAA
  • GAA
  • CAA

And I have to obtain:

  • AWA (AAA + ATA)
  • WWA (AAA + ATA + TAA + TTA)
  • RAA (AAA + GAA)
  • WAA (AAA + TAA)
  • YAA (CAA + TAA)
  • NAA (AAA + TAA + CAA + GAA)
  • ...
DNA motif IUAPC • 2.5k views
ADD COMMENT
0
Entering edit mode
7.8 years ago

Make a lookup table. For only two possibilities per site this becomes a 4x4 symmetric table.

ADD COMMENT
0
Entering edit mode

thank you, can you give an example please ?

ADD REPLY
0
Entering edit mode
|   | A | C | T | G |
+---+---+---+---+---+
| A | A | M | W | R |
| C | M | A | Y | S |
| T | W | Y | T | K |
| G | R | S | K | G |

It seems that the table markdown doesn't render nicely. Anyway, there are many similar approaches, depending on your exact needs.

ADD REPLY
0
Entering edit mode

but it means that each position is studied separately (a lookup table for each position), no ?

ADD REPLY
0
Entering edit mode

There's only one lookup table. Yes, each position is set separately, but that's not a problem and you can then generate these for arbitrary lengths efficiently.

ADD REPLY
0
Entering edit mode
7.8 years ago

I would recommend translating to 4-bit with a linear lookup table, combining with OR, then using another linear lookup table to translate from 4-bit to IUPAC.

A -> 1
C -> 2
G -> 4
T -> 8

For example, the value at location of the ASCII value of 'A' should be 1.

So, A or C or T becomes:

1|2|8 = 11

The value at index 11 in the second lookup table is whatever the IUPAC symbol is for A or C or T. Specifically:

byte[] numberToIUPAC={
    'N','A','C','M','G','R','S','V', //0-7
    'T','W','Y','H','K','D','B','N', //8-15
};
ADD COMMENT
0
Entering edit mode

that's mean, that i have to compare each motif to all the other motifs, if the distance of hamming is equal to 1 (the 2 compared motifs differs by 1 letter) => get the corresponding IUPAC letter using your method ?

ADD REPLY
0
Entering edit mode

No. You process the first letter of all sequences at once. For example, if you have:

AAA
ACC
ACT
GCT

Then you get a motif like this:

First letter: 
A+A+A+G = 1|1|1|4 = 5 = R
Second letter:
A+C+C+C = 1|1|1|2 = 3 = M
Third letter:
A+C+T+T = 1|2|8|8 = 11 = H

So your motif is RMH.

Edit: I think, perhaps, I am not answering your original question, though. That's mainly because I cannot imagine a scenario in which it would be useful to obtain each degenerate sequence representing every possible subset of a set of sequences. What are you trying to accomplish?

ADD REPLY
0
Entering edit mode

for me each every possible exact sequence of a degenerated sequence should be represent. What I try to do, is to combine some over-represented exact motifs into some degenerated motifs. So it's not just a regular expression. thanks

ADD REPLY

Login before adding your answer.

Traffic: 2133 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6