Question

Back Translating Protein Sequence Without Ambiguities...

5

Entering edit mode

13.5 years ago

Badgerichard ▴ 50

(first question so be gentle...!)

I am wanting to do an exhaustive search over the human genome for DNA sequences that could encode a particular amino acid sequence. I got as far as back translating the amino acids into a DNA sequence with IUPAC ambiguity codes, but could not find a utility to disambiguate it (i.e. generate all possible sequences that could be encoded by the ambiguous sequence). I am intending to take the set of sequences and utilise a short read mapper that does not accept ambiguities, so need a set of discrete nucleotide sequences with no ambiguities.

Thanks in advance!

protein dna • 9.9k views

ADD COMMENT • link updated 10.5 years ago by Biostar 20 • written 13.5 years ago by Badgerichard ▴ 50

Ram · Answer 1 · 2010-10-26

I don't know of a tool to generate all possible nucleotides from an ambiguous back-translation. In fact, it's quite a tricky computational problem.

Let's take this very short peptide: MVIL.

Using Bioperl's Bio::Tools::SeqPattern, we can back-translate and expand to a regular expression:

use Bio::Tools::SeqPattern;
my $pat     = 'MVIL';
my $pattern = Bio::Tools::SeqPattern->new(-seq =>$pat, -type =>'amino');
print $pattern->backtranslate->expand, "\n";
# result
ATGGT.AT[ACT][CT]T.

If you picture moving along that regular expression, one position at a time and splitting each time that there is more than one option, you can see that there are 4 x 3 x 2 x 4 splits, giving 96 possible combinations of bases for just a 4 residue peptide.

I'm surprised that there is not an algorithm to convert a regular expression into all possible string combinations. It should not be too hard to program, based on the example above, but I'd be tempted as Dan says to look at another approach which uses the ambiguity.

score 4 · Answer 2 · 2010-10-26

4

Entering edit mode

13.5 years ago

User 59 13k

I think you might want to try this approach: "Fast-Find: A novel computational approach to analyzing combinatorial motifs"

They index the 'target' DNA sequences you wish to search then allow you to run a degenerate search against those sequences to return matches.

This seems like it will achieve what you want to do, rather than write something that would have to 'disambiguate' into quite a large sequence space if you were to perform every possible back-translation (depending on the size of your peptide of course).

ADD COMMENT • link 13.5 years ago by User 59 13k

0

Entering edit mode

+1 that looks useful. And pretty fast according to a quick read.

ADD REPLY • link 13.5 years ago by brentp 24k

Ram · Answer 3 · 2011-03-09

Slightly long-winded approach in Python that does back-translation on ambiguous nucleotides instead of amino acids:

#!/usr/bin/env python

# 1. define codon table
codon_table = {
    'A': ('GCT', 'GCC', 'GCA', 'GCG'),
    'C': ('TGT', 'TGC'),
    'D': ('GAT', 'GAC'),
    'E': ('GAA', 'GAG'),
    'F': ('TTT', 'TTC'),
    'G': ('GGT', 'GGC', 'GGA', 'GGG'),
    'I': ('ATT', 'ATC', 'ATA'),
    'H': ('CAT', 'CAC'),
    'K': ('AAA', 'AAG'),
    'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
    'M': ('ATG',),
    'N': ('AAT', 'AAC'),
    'P': ('CCT', 'CCC', 'CCA', 'CCG'),
    'Q': ('CAA', 'CAG'),
    'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
    'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
    'T': ('ACT', 'ACC', 'ACA', 'ACG'),
    'V': ('GTT', 'GTC', 'GTA', 'GTG'),
    'W': ('TGG',),
    'Y': ('TAT', 'TAC'),
    '*': ('TAA', 'TAG', 'TGA'),
}

# 2. create dictionary of base possibilities mapped to ambiguous bases 
ambiguous_bases = {
    frozenset(['A']): 'A',
    frozenset(['C']): 'C',
    frozenset(['G']): 'G',
    frozenset(['T']): 'T',
    frozenset(['U']): 'T',
    frozenset(['A', 'G']): 'R',
    frozenset(['C', 'T']): 'Y',
    frozenset(['G', 'C']): 'S',
    frozenset(['A', 'T']): 'W',
    frozenset(['G', 'T']): 'K',
    frozenset(['A', 'C']): 'M',
    frozenset(['C', 'G', 'T']): 'B',
    frozenset(['A', 'G', 'T']): 'D',
    frozenset(['A', 'C', 'T']): 'H',
    frozenset(['A', 'C', 'G']): 'V',
    frozenset(['A', 'C', 'G', 'T']): 'N',
}

# 3. naively determine ambiguous codons from ambiguous bases
ambiguous_codon_table = {}
for amino_acid, codons in codon_table.iteritems():
    ambiguous_codon = ''
    for i in range(3): # len(codon)
        bases = frozenset([codon[i] for codon in codons])
        ambiguous_codon += ambiguous_bases[bases]
    ambiguous_codon_table[amino_acid] = (ambiguous_codon,)

# 4. correct the incorrectly determined codons from 2 
ambiguous_codon_table['L'] = ('TTR', 'CTN')
ambiguous_codon_table['R'] = ('CGN', 'AGR')
ambiguous_codon_table['S'] = ('TCN', 'AGY')
ambiguous_codon_table['*'] = ('TAR', 'TGA')

# backtranslation functions that work for any string with a codon_table that is
# a dict mapping characters to a sequence of strings.

def backtranslate_permutations(protein, codon_table=codon_table):
    '''Returns the number of back-translated nucleotide sequences for a protein
    and codon table combination.

    >>> protein = 'ACDEFGHIKLMNPQRSTVWY*'
    >>> backtranslate_permutations(protein)
    1019215872
    >>> backtranslate_permutations(protein, codon_table=ambiguous_codon_table)
    16
    '''
    permutations = 1
    for amino_acid in protein:
        permutations *= len(codon_table[amino_acid])
    return permutations

def backtranslate(protein, codon_table=codon_table):
    '''Returns the back-translated nucleotide sequences for a protein and codon 
    table combination.

    >>> protein = 'FVC'
    >>> len(backtranslate(protein))
    16
    '''
    # create initial sequences == list of codons for the first amino acid
    sequences = [codon for codon in codon_table[protein[0]]]
    for amino_acid in protein[1:]:
        # add each codon to each existing sequence replacing sequences
        # leaves (num_codons * num_sequences) for next amino acid 
        to_extend = sequences
        sequences = []
        for codon in codon_table[amino_acid]:
            for sequence in to_extend:
                sequence += codon
                sequences.append(sequence)
    return sequences

# 5. reverse ambiguous_bases to get a kind of codon table 
ambiguous_bases_reversed = dict((value, sorted(list(key))) for (key, value) in ambiguous_bases.iteritems())

# 6. Reuse the backtranslation functions with ambiguous nucleotides instead
def disambiguate(ambiguous_dna):
    '''Call backtranslate with ambiguous DNA and reversed ambiguous bases.

    >>> ambiguous_dna = 'ACGTRYSWKMBDHVN'
    >>> backtranslate_permutations(ambiguous_dna, ambiguous_bases_reversed)
    20736
    >>> len(disambiguate(ambiguous_dna))
    20736
    '''
    return backtranslate(ambiguous_dna, ambiguous_bases_reversed)

if __name__ == '__main__':
    import doctest
    doctest.testmod()

score 0 · Answer 4 · 2010-10-28

0

Entering edit mode

13.5 years ago

Peter 6.0k

Have you ruled out using TBLASTN? It takes a protein query and compares it to a nucleotide database - see http://blast.ncbi.nlm.nih.gov/Blast.cgi

ADD COMMENT • link 13.5 years ago by Peter 6.0k