Question

MultiProcessing on SeqIO biopython objects

0

Entering edit mode

8.7 years ago

Maxime B ▴ 10

Hi, I want to parse a big fasta file (253721 sequences) in order to extract one sequence per specie present in my list (which consists of 86859 different species).

I wrote some code which works great using BioPython SeqIO and regexp, but for the amount of sequences and species, it will take days to run...)

I started to look for some way of parallelizing the work using multiprocessing in python, however, I can't really connect the examples (for example here) with my code...

Here is my not-parallelized code:

#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
import subprocess
import itertools
import re

def read_names (file_to_open) :
    name=[]
    names = open (file_to_open , "r")
    for I in names :
        name.append(i.rstrip('\n'))
    name=set(name)
    names.close()

    return(name)

names = read_names ("specie_to_ncbi.txt")
filename = "all_its.fa"
lowsize = 300
highsize = 1000

seqs=[]
seq_ids=[]
j = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
    j+=1
print "there are "+str(j)+" sequences in the fasta file"

i = 0
for seq_record in (SeqIO.parse(filename, "fasta")) :
    i+=1
    numb = str(round((i/j)*100,2))
    print numb+" % of the sequences already computed"
    longueur = (len(seq_record.seq))
    if (longueur > lowsize and longueur < highsize) :
         ident = (seq_record.description)
        for specie in names :
            if specie in str(ident) :
                myseq = str((seq_record.seq))
                seq_ids.append(str(seq_record.id)+"|"+str(specie))
                seqs.append(myseq)
                names.remove(specie)
                break




ITS =  open("very_simple_db_ITS.fa" , "w")
for i,s in zip(seq_ids, seqs) :
    ITS.write(">"+i)
    ITS.write("\n")
    ITS.write(s)
    ITS.write("\n")
ITS.close()
   </pre>

---
And here is my failed attempt to parallelize it using multiprocessing

---
<pre>
#!/usr/bin/env python
from Bio import Entrez
from Bio import SeqIO
from multiprocessing import Pool
import subprocess
import itertools
import re

def read_names (file_to_open) :
    name=[]
    names = open (file_to_open , "r")
    for I in names :
        name.append(i.rstrip('\n'))
    name=set(name)
    names.close()

    return(name)

names = read_names ("head_specie_to_ncbi.txt")
filename = "head_all_its.fa"
lowsize = 300
highsize = 1000

def calc(seq_record , specie_names) :
    dictio = {}

    filename = "all_its.fa"

    longueur = (len(seq_record.seq))
    if (longueur > lowsize and longueur < highsize) :
         ident = (seq_record.description)
        for specie in specie_names :
            if specie in str(ident) :
                myseq = str((seq_record.seq))
                seq_ids.append(str(seq_record.id)+"|"+str(specie))
                dictio[str(ident)] = myseq
                break
    return dictio

if __name__ == '__main__':
    pool = Pool()
    mydict = pool.map(calc(SeqIO.parse(filename, "fasta"), names))
    print str(mydict)

Edit : I reduced the computing time to 20 minutes by replacing the regex with "word in string" python function, but still wondering about the multiprocessing...

multiprocessing pool biopython • 4.7k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by Maxime B ▴ 10

0

Entering edit mode

You might mention why it failed and with what exception. From what I see you are passing a function call to the map method, which expects an uncalled function and a sequence over which to map that function.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Matt Shirley 10k

0

Entering edit mode

Also be aware that 'bee' in 'beetle' will return True as the string __contains__ method does substring matches and you probably are interested in word boundaries.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Matt Shirley 10k