Degenerate Nucleotide Sequences
2
2
Entering edit mode
13.1 years ago
Max ▴ 20

Are there any programs available (Python, Perl, whatever), that can take a degenerate nucleotide sequence and translate it into its multiple possible oligos?

Any help would be appreciated.

sequence perl • 10k views
ADD COMMENT
0
Entering edit mode

can you give use a typical example ? because the simple sequence ATGCNNNNATGC would generate 256 oligos....

ADD REPLY
8
Entering edit mode
13.1 years ago

This program should do the job:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define RUN(base) copy[index]=base; recurs(seq,copy,index+1,len)

static void recurs(const char *seq,char* copy,int index,const int len)
   {
   if(index==len)
      {
      fwrite(copy,sizeof(char),len,stdout);
      fputc('\n',stdout);
      }
   else
      {
      switch(toupper(seq[index]))
           {
           case 'A':case 'T': case 'G': case 'C':  RUN(seq[index]); break;
           case 'N':RUN('A');RUN('T');RUN('G');RUN('C');break;
           case 'W':RUN('A');RUN('T');break;
           case 'S':RUN('G');RUN('C');break;
           case 'B':RUN('T');RUN('G');RUN('C');break;
           case 'D':RUN('A');RUN('T');RUN('G');break;
           case 'H':RUN('A');RUN('T');RUN('C');break;
           case 'V':RUN('A');RUN('G');RUN('C');break;
           case 'K':RUN('G');RUN('T');break;
           case 'M':RUN('A');RUN('C');break;
           case 'R':RUN('A');RUN('G');break;
           case 'Y':RUN('C');RUN('T');break;
           default: fprintf(stderr,"Bad base in %s (%c)\n",seq,seq[index]); exit(EXIT_FAILURE);break;
           }
      }
   }

int main(int argc,char** argv)
   {
   char* seq;
   int len,i;
   if(argc!=2)
      {
      fprintf(stderr,"Usage : %s <dna>",argv[0]);
      return EXIT_FAILURE;
      }
   seq=argv[1];
   len=strlen(seq);
   char* copy=malloc((len+1)*sizeof(char));
   if(copy==NULL)
      {
      fprintf(stderr,"Out of memory\n");
      exit(EXIT_FAILURE);
      }
   copy[len]='\0';
   recurs(seq,copy,0,len);
   free(copy);
   return 0;
   }

Compilation:

gcc -O3 -o prg prg.c

test

> ./prg ATGCTGATCGAGCTANATCGATCGGACTACY
ATGCTGATCGAGCTAAATCGATCGGACTACC
ATGCTGATCGAGCTAAATCGATCGGACTACT
ATGCTGATCGAGCTATATCGATCGGACTACC
ATGCTGATCGAGCTATATCGATCGGACTACT
ATGCTGATCGAGCTAGATCGATCGGACTACC
ATGCTGATCGAGCTAGATCGATCGGACTACT
ATGCTGATCGAGCTACATCGATCGGACTACC
ATGCTGATCGAGCTACATCGATCGGACTACT
ADD COMMENT
4
Entering edit mode

Now, does this show that Lars is faster than Pierre or that Perl is faster to code than C? (Sorry couldn't resist).

ADD REPLY
0
Entering edit mode

nice macro recursion!

ADD REPLY
0
Entering edit mode

Actually, you can compile just by

gcc -o prg prg.c

ADD REPLY
0
Entering edit mode

or even:

gcc -o prg.o -Wall -c prg.c  && gcc -o prg prg.o
ADD REPLY
3
Entering edit mode
13.1 years ago

Like Pierre said, this is a combinatorial problem that can quickly blow up in your face. However, that aside, it easy to generate all the combinations using a recursion. Here is a compact implementation in Perl:

#!/usr/bin/perl -w

# Lookup table of degenerate IUPAC nucleotide codes.
my %deg2nuc = (
    "R" => ["A", "G"],
    "Y" => ["C", "T"],
    "S" => ["G", "C"],
    "W" => ["A", "T"],
    "K" => ["G", "T"],
    "M" => ["A", "C"],
    "B" => ["C", "G", "T"],
    "D" => ["A", "G", "T"],
    "H" => ["A", "C", "T"],
    "V" => ["A", "C", "G"],
    "N" => ["A", "C", "G", "T"]
);

# Recursive function that replaces degenerate nucleotides with all combinations.
sub generate
{
    if ($_[0] =~ /(.*)([RYSWKBDHVN])(.*)/) {
        my $head = $1;
        my $tail = $3;
        my @seqs;
        foreach my $nuc (@{$deg2nuc{$2}}) {
            push @seqs, generate($head.$nuc.$tail);
        }
        return @seqs;
    }
    else {
        return $_[0];
    }
}

# Demo: print all sequences generated from ANCRG.
print join("\n", generate("ANCRG")), "\n";

EDIT:

On second thought, that was not at all compact by Perl standards. Here is the really compact version that ventures into code golf territory:

#!/usr/bin/perl -w

my %A = ("R"=>1, "W"=>1, "M"=>1, "D"=>1, "H"=>1, "V"=>1, "N"=>1);
my %C = ("Y"=>1, "S"=>1, "M"=>1, "B"=>1, "H"=>1, "V"=>1, "N"=>1);
my %G = ("R"=>1, "S"=>1, "K"=>1, "B"=>1, "D"=>1, "V"=>1, "N"=>1);
my %T = ("Y"=>1, "W"=>1, "K"=>1, "B"=>1, "D"=>1, "H"=>1, "N"=>1);

$_ = "ANCRG\n";
while (s/(.*)([RYSWKBDHVN])(.*)\n/(exists $A{$2} ? "$1A$3\n" : "").(exists $C{$2} ? "$1C$3\n" : "").(exists $G{$2} ? "$1G$3\n" : "").(exists $T{$2} ? "$1T$3\n" : "")/e) {}
print;
ADD COMMENT
0
Entering edit mode

PS. I said "compact", not "efficient" ;-)

ADD REPLY
0
Entering edit mode

cool code~I was shocked~

ADD REPLY
0
Entering edit mode

What happened to "M"?

ADD REPLY
0
Entering edit mode

This is amazing Lars. One humble request though. Would you be able to give a brief description of what you did here? I am a novice at perl and trying to understand the login behind it. Sorry for the trouble. Thank you once again for this.

ADD REPLY

Login before adding your answer.

Traffic: 2370 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6