Back Translating Protein Sequence Without Ambiguities...
4
5
Entering edit mode
13.5 years ago
Badgerichard ▴ 50

(first question so be gentle...!)

I am wanting to do an exhaustive search over the human genome for DNA sequences that could encode a particular amino acid sequence. I got as far as back translating the amino acids into a DNA sequence with IUPAC ambiguity codes, but could not find a utility to disambiguate it (i.e. generate all possible sequences that could be encoded by the ambiguous sequence). I am intending to take the set of sequences and utilise a short read mapper that does not accept ambiguities, so need a set of discrete nucleotide sequences with no ambiguities.

Thanks in advance!

protein dna • 9.9k views
ADD COMMENT
6
Entering edit mode
13.5 years ago
Neilfws 49k

I don't know of a tool to generate all possible nucleotides from an ambiguous back-translation. In fact, it's quite a tricky computational problem.

Let's take this very short peptide: MVIL.

Using Bioperl's Bio::Tools::SeqPattern, we can back-translate and expand to a regular expression:

use Bio::Tools::SeqPattern;
my $pat     = 'MVIL';
my $pattern = Bio::Tools::SeqPattern->new(-seq =>$pat, -type =>'amino');
print $pattern->backtranslate->expand, "\n";
# result
ATGGT.AT[ACT][CT]T.

If you picture moving along that regular expression, one position at a time and splitting each time that there is more than one option, you can see that there are 4 x 3 x 2 x 4 splits, giving 96 possible combinations of bases for just a 4 residue peptide.

I'm surprised that there is not an algorithm to convert a regular expression into all possible string combinations. It should not be too hard to program, based on the example above, but I'd be tempted as Dan says to look at another approach which uses the ambiguity.

ADD COMMENT
0
Entering edit mode

Thanks for the code snippet (at least insured my bioperl install was working). I've understood that the . means [GATC] and can see logic. So what is left to do is right a little perl to iterate through all possible sequences...

ADD REPLY
0
Entering edit mode

Exactly, nicely explained

ADD REPLY
4
Entering edit mode
13.5 years ago
User 59 13k

I think you might want to try this approach: "Fast-Find: A novel computational approach to analyzing combinatorial motifs"

They index the 'target' DNA sequences you wish to search then allow you to run a degenerate search against those sequences to return matches.

This seems like it will achieve what you want to do, rather than write something that would have to 'disambiguate' into quite a large sequence space if you were to perform every possible back-translation (depending on the size of your peptide of course).

ADD COMMENT
0
Entering edit mode

+1 that looks useful. And pretty fast according to a quick read.

ADD REPLY
4
Entering edit mode
13.1 years ago
Jvb ▴ 50

Slightly long-winded approach in Python that does back-translation on ambiguous nucleotides instead of amino acids:

#!/usr/bin/env python

# 1. define codon table
codon_table = {
    'A': ('GCT', 'GCC', 'GCA', 'GCG'),
    'C': ('TGT', 'TGC'),
    'D': ('GAT', 'GAC'),
    'E': ('GAA', 'GAG'),
    'F': ('TTT', 'TTC'),
    'G': ('GGT', 'GGC', 'GGA', 'GGG'),
    'I': ('ATT', 'ATC', 'ATA'),
    'H': ('CAT', 'CAC'),
    'K': ('AAA', 'AAG'),
    'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
    'M': ('ATG',),
    'N': ('AAT', 'AAC'),
    'P': ('CCT', 'CCC', 'CCA', 'CCG'),
    'Q': ('CAA', 'CAG'),
    'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
    'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
    'T': ('ACT', 'ACC', 'ACA', 'ACG'),
    'V': ('GTT', 'GTC', 'GTA', 'GTG'),
    'W': ('TGG',),
    'Y': ('TAT', 'TAC'),
    '*': ('TAA', 'TAG', 'TGA'),
}

# 2. create dictionary of base possibilities mapped to ambiguous bases 
ambiguous_bases = {
    frozenset(['A']): 'A',
    frozenset(['C']): 'C',
    frozenset(['G']): 'G',
    frozenset(['T']): 'T',
    frozenset(['U']): 'T',
    frozenset(['A', 'G']): 'R',
    frozenset(['C', 'T']): 'Y',
    frozenset(['G', 'C']): 'S',
    frozenset(['A', 'T']): 'W',
    frozenset(['G', 'T']): 'K',
    frozenset(['A', 'C']): 'M',
    frozenset(['C', 'G', 'T']): 'B',
    frozenset(['A', 'G', 'T']): 'D',
    frozenset(['A', 'C', 'T']): 'H',
    frozenset(['A', 'C', 'G']): 'V',
    frozenset(['A', 'C', 'G', 'T']): 'N',
}

# 3. naively determine ambiguous codons from ambiguous bases
ambiguous_codon_table = {}
for amino_acid, codons in codon_table.iteritems():
    ambiguous_codon = ''
    for i in range(3): # len(codon)
        bases = frozenset([codon[i] for codon in codons])
        ambiguous_codon += ambiguous_bases[bases]
    ambiguous_codon_table[amino_acid] = (ambiguous_codon,)

# 4. correct the incorrectly determined codons from 2 
ambiguous_codon_table['L'] = ('TTR', 'CTN')
ambiguous_codon_table['R'] = ('CGN', 'AGR')
ambiguous_codon_table['S'] = ('TCN', 'AGY')
ambiguous_codon_table['*'] = ('TAR', 'TGA')

# backtranslation functions that work for any string with a codon_table that is
# a dict mapping characters to a sequence of strings.

def backtranslate_permutations(protein, codon_table=codon_table):
    '''Returns the number of back-translated nucleotide sequences for a protein
    and codon table combination.

    >>> protein = 'ACDEFGHIKLMNPQRSTVWY*'
    >>> backtranslate_permutations(protein)
    1019215872
    >>> backtranslate_permutations(protein, codon_table=ambiguous_codon_table)
    16
    '''
    permutations = 1
    for amino_acid in protein:
        permutations *= len(codon_table[amino_acid])
    return permutations

def backtranslate(protein, codon_table=codon_table):
    '''Returns the back-translated nucleotide sequences for a protein and codon 
    table combination.

    >>> protein = 'FVC'
    >>> len(backtranslate(protein))
    16
    '''
    # create initial sequences == list of codons for the first amino acid
    sequences = [codon for codon in codon_table[protein[0]]]
    for amino_acid in protein[1:]:
        # add each codon to each existing sequence replacing sequences
        # leaves (num_codons * num_sequences) for next amino acid 
        to_extend = sequences
        sequences = []
        for codon in codon_table[amino_acid]:
            for sequence in to_extend:
                sequence += codon
                sequences.append(sequence)
    return sequences

# 5. reverse ambiguous_bases to get a kind of codon table 
ambiguous_bases_reversed = dict((value, sorted(list(key))) for (key, value) in ambiguous_bases.iteritems())

# 6. Reuse the backtranslation functions with ambiguous nucleotides instead
def disambiguate(ambiguous_dna):
    '''Call backtranslate with ambiguous DNA and reversed ambiguous bases.

    >>> ambiguous_dna = 'ACGTRYSWKMBDHVN'
    >>> backtranslate_permutations(ambiguous_dna, ambiguous_bases_reversed)
    20736
    >>> len(disambiguate(ambiguous_dna))
    20736
    '''
    return backtranslate(ambiguous_dna, ambiguous_bases_reversed)

if __name__ == '__main__':
    import doctest
    doctest.testmod()
ADD COMMENT
0
Entering edit mode

great code! Just a small bug that affects the behavior of ambiguous_bases_reversed (when a T is given to that dict), to fix it, delete the following line:

frozenset(['U']): 'T',
ADD REPLY
0
Entering edit mode
13.5 years ago
Peter 6.0k

Have you ruled out using TBLASTN? It takes a protein query and compares it to a nucleotide database - see http://blast.ncbi.nlm.nih.gov/Blast.cgi

ADD COMMENT

Login before adding your answer.

Traffic: 2077 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6