Script For Parsing A Biological Sequence From A Public Database In Python
2
4
Entering edit mode
13.0 years ago
Spyros ▴ 40

Greetings to the Biostar community,

I am currently following a bioinformatics module as part of a biomedical degree (I am basically a python newbie) and the following task is required as part of a Python programming assignment:

extract motif sequences (amino acid sequences, so basically strings in programmatic-speak, that have been excised from algorithms implementing a multiple sequence alignment and subsequently iterative database scanning to generate the best conserved sequences. The ultimate idea is to infer functional significance from such "motifs").

These motifs are stored on a public database in files which have multiple data fields corresponding to each protein (uniprot ID, Accession Number, the alignment itself stored in a hyperlink .seq file), currently one of which is of interest in this scope. The data field is called "extracted motif sets".

My question is how to go about writing a script that will essentially parse the "motif strings" and output them to a file. I have now coded the script so that it looks as follows (I don't write the results to files yet):

import os, re, sys, string 

printsdb = open('/users/spyros/folder1/python/PRINTSmotifs/prints41_1.kdat', 'r')

protname = None  
final_motifs = []

for line in printsdb.readlines():
    if line.startswith('gc;'):
            protname = line.lstrip()    
            #string.lower(name)  # convert to lowercase
            break

def extract_final_motifs(protname):

    """Extracts the sequences of the 'final motifs sets' for a PRINTS entry.
    Sequences are on lines starting 'fd;' A simple regex is used for retrieval"""

    for line in printsdb.readlines():
            if line.startswith('fd;'):
                    final_motifs = re.compile('^\s+([A-Z]+)\s+<')
                    final_motifs = final_motifs.match(line)
                    #print(final_motifs.groups()[0])
                    motif_dict = {protname : final_motifs}
                    break 
    return 

motif_dict = extract_final_motifs('ADENOSINER')
print(motif_dict)

The problem now is that while my code loops over a raw database file (prints41_!.kdat) instead of connecting to the public database using urllib module, as suggested by Simon Cockell below, the ouput of the script is simply "none" on the python shell, whereas it should be creating a list such as [AAYIGIEVLI, AAYIGIEVLI, AAYIGIEVLI, etc..]

Does anybody have any idea where the logic error is? Any input appreciated!!

biopython python parsing sequence • 5.4k views
ADD COMMENT
4
Entering edit mode

@spyros: If the input is in a "public database" why not link to samples of the data being processed. Also, it'd be of use to know you crossed posted this question here.

ADD REPLY
1
Entering edit mode

@blunders is right, a link to the data you are trying to parse would really help people to help you out.

ADD REPLY
0
Entering edit mode

@blunders: I originally posted the question on stackoverflow, where a member suggested I post it here as well since it is a bioinformatics-field forum, hence the double posting.

A link to one of the files on database: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/PRINTS/DoPRINTS.pl?cmd_a=Display&qua_a=/Full&fun_a=Code&qst_a=ADENOSINER

What I would like is basically to parse the sequences (strings of capitals) under the field "FINAL MOTIF SETS" (it is the last data field) which in this case is divided into 6 sections according to "motif number". I need to collect these sequences.

ADD REPLY
0
Entering edit mode

IMHO, you should address it to BioStar community, not stackoverflow :)

ADD REPLY
0
Entering edit mode

Spyros, if your edited code is exactly as you've pasted here then I think I can see the problem... extract_final_motifs() doesn't return anything :)

ADD REPLY
0
Entering edit mode

... actually looking back over it, there are few little errors. (The best way to deal with complex problems is to break them down into the little tasks you have to do, and test each step as you build your solution)

ADD REPLY
0
Entering edit mode

Yes, there are a number of problems with this script, not least of which is the fact that extract_final_motifs() returns None (which means that will always be the output of your program, regardless of whether what precedes it works or not, which it doesn't). You should read about what break does, and when to use it appropriately http://docs.python.org/release/2.5.2/ref/break.html

ADD REPLY
3
Entering edit mode
13.0 years ago

What I would like is basically to parse the sequences (strings of capitals) under the field "FINAL MOTIF SETS"

here is a quick script that can get you started, note that the data is an HTML file that may or may not have a rigid enough structure for easy parsing

import re, urllib

# get the data
stream = urllib.urlopen('http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/PRINTS/DoPRINTS.pl?cmd_a=Display&qua_a=/Full&fun_a=Code&qst_a=ADENOSINER')

# forward to the region of interest
for line in stream:
    if line.startswith('FINAL MOTIF SETS'):
        break

patt = re.compile('^\s+([A-Z]+)\s+<')

for line in stream:
    m = patt.match(line)
    if m:
        print m.groups()[0]

It prints

AAYIGIEVLI
AAYIGIEVLI
AAYIGIEVLI
AAYIGIEVLI
AAYIGIEVLI
AAYIGIEVLI
AAYIGIEVLI
AAYISIEVLI
SVYITVELAI
SVYITVELVI
SVYIMVELAI
SVYITVELAI
WVYITVELAI
TTYIVLELII
ALYVALELVI
ALYVALELVI

...

ADD COMMENT
0
Entering edit mode

@Istivan: Thank you a lot for this, it does more or less exactly what I needed! So regexs are the way to go for tasks like this I guess, despite HTML source code having somewhat of a bad reputation when it comes to regexs... A loop that iterates over the database implementing the your code should now do the trick for extracting all motifs from database (with the possibility of badly written HTML code that won't generally bother a browser but might bother a regex implementation?) Thanks again!

ADD REPLY
2
Entering edit mode
13.0 years ago

A little bit of digging around in the PRINTS website found me the raw data. The .dat file that is distributed from there contains all the information in the webpage you link to, but in a slightly more regimented, plain text format. This should be simpler to parse, and all the information you are after will be in this single file.

Here's some (very crude) code:

def set_prints_dict():
    entry = None
    prints = {}
    fh = open('prints41_1.dat', 'r')
    for line in fh.readlines():
        if line.startswith('gc;'):
            if entry:
                prints[name] = entry
            name = line.rstrip().split()[1]
            entry = line
        else:
            entry += line
    return prints

def get_final_motifs(prints, entry):
    prints_entry = prints[entry]
    final_motifs = []
    for line in prints_entry.split('\n'):
        if line.startswith('fd;'):
            final_motifs.append(line.split()[1])
    return final_motifs

if __name__ == '__main__':
    printsdb = set_prints_dict()
    motifs = get_final_motifs(printsdb, 'ADENOSINER')

This will give you a list of motif sequences for the named PRINTS entry. I'm perfectly sure there are better, and more elegant ways of doing this.

The list motifs looks like this:

['AAYIGIEVLI', 'AAYIGIEVLI', 'AAYIGIEVLI', 'AAYIGIEVLI', 'AAYIGIEVLI', 'AAYIGIEVLI',
'AAYIGIEVLI', 'AAYISIEVLI', 'SVYITVELAI', 'SVYITVELVI', 'SVYIMVELAI', 'SVYITVELAI',
'WVYITVELAI', 'TTYIVLELII', 'ALYVALELVI', 'ALYVALELVI', 'ALYVALELVI', 'VTYITVEILI',
'VTYITVEILI', 'VTYITMEIFI', 'ITYITMEAAI', 'ASYIVFEIVI',...

PS - the code is in my BitBucket.

ADD COMMENT
0
Entering edit mode

@Simon Cockell: Thank you a lot for your input, I have tried a very similar implementation to yours (using regexs for actual protein sequence extraction) but my code has a logic error since it does not output the compiled motifs. I show my version with its issue above. Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 1567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6