Approximate Dna Pattern Matching In Python
4
6
Entering edit mode
12.3 years ago
Abhi ★ 1.6k

Hey guys

I am trying to find something already written in python which will allow me to do approximate pattern matching. So I have billions of query sequences which I want to match against just one search sequence or pattern on both strands allowing up to n mismatches.

Anything out there that would help.

Thanks! -Abhi

python • 11k views
ADD COMMENT
0
Entering edit mode

Hi Abhi,

Were you able to implement approximate searching?

Thanks, Jitendra

ADD REPLY
5
Entering edit mode
12.3 years ago

You should try MOODS: it's a suite of algorithms for matching position weight matrices (PWM) against DNA sequences. I use it on a daily basis, it's a very good piece of software written in C++ with interface for python (a simple import MOODS, and up you go!). The must difficult step is to convert your query sequences in PWM. Here is the function I use:

def primer2pwm(primer):
    """
    Write a primer sequence as a position weight matrix.
    """
    # Create 4 lists of length equal to primer's length.
    matrix = [[0] * len(primer) for i in range(4)]
    # List of correspondance IUPAC.
    IUPAC = {
        "A" : ["A"], 
        "C" : ["C"], 
        "G" : ["G"],        
        "T" : ["T"],        
        "U" : ["U"],        
        "R" : ["G", "A"], 
        "Y" : ["T", "C"], 
        "K" : ["G", "T"], 
        "M" : ["A", "C"], 
        "S" : ["G", "C"], 
        "W" : ["A", "T"], 
        "B" : ["C", "G", "T"], 
        "D" : ["A", "G", "T"], 
        "H" : ["A", "C", "T"], 
        "V" : ["A", "C", "G"], 
        "N" : ["A", "C", "G", "T"]
    }
    # Position of nucleotides in the PWM.
    dico = {"A" : 0,  "C" : 1, "G" : 2, "T" : 3}
    # Read each IUPAC letter in the primer.
    for index, letter in enumerate(primer):
        for nuc in IUPAC.get(letter):
            i = dico.get(nuc)
            matrix[i][index] = 1
    return matrix
ADD COMMENT
0
Entering edit mode

Thanks for a quick reply. Just wondering if PWM model will be right in this case. I dont have any emperical observation score for my pattern ?

ADD REPLY
0
Entering edit mode

No problem. Your PWM model will be filled with zeros and ones (for instance, if your first nucleotide is a G, your PWM will start with A = 0 ; C = 0 ; G = 1 ; T = 0).

ADD REPLY
0
Entering edit mode

Ok great and one more question. For reverse compliment matches do I need to construct to matches and subsequently calls moods twice for each query ?

ADD REPLY
0
Entering edit mode

I think so. Anyway, its easy to reverse-complement a sequence with biopython. See http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc23 for examples.

ADD REPLY
0
Entering edit mode

Is there any limitation on size of query sequence?

ADD REPLY
2
Entering edit mode
12.3 years ago

Perhaps it would also be helpful to use the "matching core" of cutadapt?

It's used to strip adapters from sequencing data, which sounds like a problem very much (in spirit) to what you are trying to solve.

ADD COMMENT
0
Entering edit mode

Steve : I think I might just be able to use cutadapt as it. You gussed it right. I am trying to get rid of the linker which is somewhat similar to an adaptor but there are some edge cases where this method might not be directly applicable. I will try talking to authors and better understand the core.

ADD REPLY
2
Entering edit mode
12.3 years ago

There is a fast Levenshtein edit distance module for Python here:

http://code.google.com/p/pylevenshtein/

ADD COMMENT
0
Entering edit mode

Albert : thanks for your comment. I think for this particular comparison using Levenshtein distance might not be appropriate as I dont want to allow indels and more so the length of the pattern and query strings are different so edit distance will always be > 1. This is based on my limited understanding. Feel free to correct me if I am wrong.

ADD REPLY
1
Entering edit mode
12.3 years ago
Andreas ★ 2.5k

Have a look at Pyhton's difflib. You can use it for approximate pattern matching.

An example taken from there:

s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

The first argument to SequenceMatcher allows you to ignore certain characters, which might be handy for ambiguity characters like N. If you want to search on the reverse complement as well, then you will have to create it (e.g. using Biopython's reverse_complement())

Matching against billions of sequences will be slow though. Two additional functions (quick_ratio() and real_quick_ratio()) might help a bit, but if calling an external program is an option for you, then have a look at e.g. Mummer or similar programs.

Andreas

ADD COMMENT

Login before adding your answer.

Traffic: 2971 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6