Question

Approximate Dna Pattern Matching In Python

6

Entering edit mode

12.3 years ago

Abhi ★ 1.6k

Hey guys

I am trying to find something already written in python which will allow me to do approximate pattern matching. So I have billions of query sequences which I want to match against just one search sequence or pattern on both strands allowing up to n mismatches.

Anything out there that would help.

Thanks! -Abhi

python • 11k views

ADD COMMENT • link updated 12.3 years ago by Andreas ★ 2.5k • written 12.3 years ago by Abhi ★ 1.6k

0

Entering edit mode

Hi Abhi,

Were you able to implement approximate searching?

Thanks, Jitendra

ADD REPLY • link 5.6 years ago by jeetsahu ▴ 10

Ram · Answer 1 · 2012-01-23

5

Entering edit mode

12.3 years ago

Frédéric Mahé ★ 3.2k

You should try MOODS: it's a suite of algorithms for matching position weight matrices (PWM) against DNA sequences. I use it on a daily basis, it's a very good piece of software written in C++ with interface for python (a simple import MOODS, and up you go!). The must difficult step is to convert your query sequences in PWM. Here is the function I use:

def primer2pwm(primer):
    """
    Write a primer sequence as a position weight matrix.
    """
    # Create 4 lists of length equal to primer's length.
    matrix = [[0] * len(primer) for i in range(4)]
    # List of correspondance IUPAC.
    IUPAC = {
        "A" : ["A"], 
        "C" : ["C"], 
        "G" : ["G"],        
        "T" : ["T"],        
        "U" : ["U"],        
        "R" : ["G", "A"], 
        "Y" : ["T", "C"], 
        "K" : ["G", "T"], 
        "M" : ["A", "C"], 
        "S" : ["G", "C"], 
        "W" : ["A", "T"], 
        "B" : ["C", "G", "T"], 
        "D" : ["A", "G", "T"], 
        "H" : ["A", "C", "T"], 
        "V" : ["A", "C", "G"], 
        "N" : ["A", "C", "G", "T"]
    }
    # Position of nucleotides in the PWM.
    dico = {"A" : 0,  "C" : 1, "G" : 2, "T" : 3}
    # Read each IUPAC letter in the primer.
    for index, letter in enumerate(primer):
        for nuc in IUPAC.get(letter):
            i = dico.get(nuc)
            matrix[i][index] = 1
    return matrix

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 12.3 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Thanks for a quick reply. Just wondering if PWM model will be right in this case. I dont have any emperical observation score for my pattern ?

ADD REPLY • link 12.3 years ago by Abhi ★ 1.6k

0

Entering edit mode

No problem. Your PWM model will be filled with zeros and ones (for instance, if your first nucleotide is a G, your PWM will start with A = 0 ; C = 0 ; G = 1 ; T = 0).

ADD REPLY • link 12.3 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Ok great and one more question. For reverse compliment matches do I need to construct to matches and subsequently calls moods twice for each query ?

ADD REPLY • link 12.3 years ago by Abhi ★ 1.6k

0

Entering edit mode

I think so. Anyway, its easy to reverse-complement a sequence with biopython. See http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc23 for examples.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Is there any limitation on size of query sequence?

ADD REPLY • link 5.6 years ago by jeetsahu ▴ 10

score 2 · Answer 2 · 2012-01-23

2

Entering edit mode

12.3 years ago

Steve Lianoglou 5.2k

Perhaps it would also be helpful to use the "matching core" of cutadapt?

It's used to strip adapters from sequencing data, which sounds like a problem very much (in spirit) to what you are trying to solve.

ADD COMMENT • link 12.3 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

Steve : I think I might just be able to use cutadapt as it. You gussed it right. I am trying to get rid of the linker which is somewhat similar to an adaptor but there are some edge cases where this method might not be directly applicable. I will try talking to authors and better understand the core.

ADD REPLY • link 12.3 years ago by Abhi ★ 1.6k

score 2 · Answer 3 · 2012-01-23

2

Entering edit mode

12.3 years ago

Istvan Albert 100k

There is a fast Levenshtein edit distance module for Python here:

http://code.google.com/p/pylevenshtein/

ADD COMMENT • link 12.3 years ago by Istvan Albert 100k

0

Entering edit mode

Albert : thanks for your comment. I think for this particular comparison using Levenshtein distance might not be appropriate as I dont want to allow indels and more so the length of the pattern and query strings are different so edit distance will always be > 1. This is based on my limited understanding. Feel free to correct me if I am wrong.

ADD REPLY • link 12.3 years ago by Abhi ★ 1.6k

score 1 · Answer 4 · 2012-01-25

Have a look at Pyhton's difflib. You can use it for approximate pattern matching.

An example taken from there:

s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

The first argument to SequenceMatcher allows you to ignore certain characters, which might be handy for ambiguity characters like N. If you want to search on the reverse complement as well, then you will have to create it (e.g. using Biopython's reverse_complement())

Matching against billions of sequences will be slow though. Two additional functions (quick_ratio() and real_quick_ratio()) might help a bit, but if calling an external program is an option for you, then have a look at e.g. Mummer or similar programs.

Andreas