Question

calling a function using glob.glob

0

Entering edit mode

6.8 years ago

bio90029 ▴ 10

HI, Hopefully someone can help me with this. I have prepared a script to extract data from a file, this part work very well, and does what I need to be done. The problem comes when I am using glob.glob, and subprocess to call the function. I keep having the above error message, and I do not know how to handle it. error message:

**File "parsing_blast.py", line 45, in <module> my_file=subprocess.Popen(cmd) File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__ errread, errwrite) File "/usr/lib64/python2.6/subprocess.py", line 1238, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

Thanks your help

from Bio.Blast import NCBIXML
from Bio import SeqIO, SearchIO
import sys, glob, subprocess, os

folders = glob.glob('/home/me/my_folder/H*')
print folders
for folder in folders:
    my_files=glob.glob(folder + '/*.xml')
    print my_files
    def parsing_blast():

        results_handle=open(my_files[0])
        blast_results=NCBIXML.parse(results_handle)
        #blast_results=NCBIXML.parse(results_handle)
        output_handle=open(folder + ' my_data_parse.xml','w')   
        #to extract some information from the blast file
        for blast_result in blast_results:
            sequence_length=blast_result.query_letters #this is the length of the sequence
            gene=blast_result.query #gene name
            #print 'The length is:', sequence_length #check point
            #print gene         #check point
            for description in blast_result.descriptions:
                title=description.title  #query seq name
                #print description.title #check point
                for alignment in blast_result.alignments:               
                    for hsp in alignment.hsps:
                        identity=hsp.identities #matching bases
                        num_gaps=hsp.gaps       #number of gaps
                        #print identity         #check point
                        #print num_gaps         #check point

                        per_identities=float(identity)/float(sequence_length)*float(100) 
                        #print per_identities   #check point
                        #sys.exit()

                        extracted_data= (gene + ',' + title + ','+ 'number_gaps: ' + str(num_gaps) +','+ 'per_identity: '+ str(per_identities) +'\n')

                        output_handle.write(extracted_data)
        output_handle.close()               

                    #sys.exit()   
    parsing_blast()



    print 'The file has been created'

biopython python glob.glob • 3.5k views

ADD COMMENT • link 6.8 years ago by bio90029 ▴ 10

0

Entering edit mode

The problem comes when I am using glob.glob, and subprocess to call the function

Why do you use subprocess to call the function?!

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I have a hundred files, all starting by H, and in all of them I have an xml file I would like to parse. I do not want to do it one by one. So I want the script to get the information I want from the file within the H* folder, storing that information on another file. When the file is created in one folder to move to the next folder, and so on. I used glob.glob and subprocess before but within a function. I just wanted to use it from outside the function so I could add another function.

ADD REPLY • link 6.8 years ago by bio90029 ▴ 10

0

Entering edit mode

I used glob.glob and subprocess before but within a function.

As far as I understood this is an entirely different use case. Instead, you need the multiprocessing module for parallelizing a function across many files.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi, I have re-edited the script, and now it works perfectly. But now I need to find out how to tell the programme to store the created file within the H files. Any help in that area, please.

ADD REPLY • link 6.8 years ago by bio90029 ▴ 10

0

Entering edit mode

I edited my answer to set the output_handle file to a file within the XML file source directory. Is that what you meant?

ADD REPLY • link 6.8 years ago by steve ★ 3.5k

score 2 · Answer 1 · 2017-06-21

2

Entering edit mode

6.8 years ago

Dan D 7.4k

The error message you're seeing is unrelated to your use of the glob function.

This line:

cmd=['parsing_blast']

is equivalent to typing

parsing_blast

on the command line. There's apparently no executable by that name available.

Are you trying to asynchronously call the parsing_blast function you've defined?

Some quick feedback while I'm looking at your code:

You can simplify your glob query:

my_files=glob.glob('/home/me/my_folder/H*/*.xml')

And it's more efficient to define your function outside of the loop. Else you're recreating it with every loop iteration, which seems unnecessary unless I'm missing something.

ADD COMMENT • link 6.8 years ago by Dan D 7.4k

0

Entering edit mode

Thanks, I had tried putting the whole path in my_files but I could not make it work. I would like the script to parse the xml files in my folders H*. I have a hundred folders, and all contained an xml file. If I work just with the function in a folder it works perfectly, I am trying to produce the script to extract the data one after the another. Thanks for your help

ADD REPLY • link 6.8 years ago by bio90029 ▴ 10

0

Entering edit mode

I have re-edited the script, and now is working. Now I need to find out how to store the created file within my H files.

ADD REPLY • link 6.8 years ago by bio90029 ▴ 10

score 1 · Answer 2 · 2017-06-22

In addition to the others' comments, if I were to try to accomplish the task you've described:

I have a hundred files, all starting by H, and in all of them I have an xml file I would like to parse.

I would use a script like this:

#!/usr/bin/env python

import os

def find_H_dirs(parent_dir):
    '''
    Find all the dirs in the parent_dir that start with H
    '''
    matches = []
    for item in os.listdir(parent_dir):
        item_path = os.path.join(parent_dir, item)
        if os.path.isdir(item) & item.startswith("H"):
            matches.append(item_path)
    return(matches)

def find_XML_files(dir):
    '''
    Find all the .xml files in a dir
    '''
    matches = []
    for item in os.listdir(dir):
        item_path = os.path.join(dir, item)
        if os.path.isfile(item_path) & item.endswith(".xml"):
            matches.append(item_path)
    return(matches)

def process_XML_file(XML_file, output_handle):
    '''
    Do a thing to the XML file
    '''
    print("Put your code for processing the {0} file here.".format(XML_file))


parent_dir = "/path/to/parent_dir"
# output_handle = "/path/to/my_data_parse.xml" # if you want it to always go to the same file

H_dirs = find_H_dirs(parent_dir = parent_dir)

for H_dir in H_dirs:
    output_handle = os.path.join(H_dir, "my_data_parse.xml")
    for XML_file in find_XML_files(dir = H_dir):
        process_XML_file(XML_file = XML_file, output_handle = output_handle)

It may be technically less efficient, but it is much simpler to write and understand, and will be easier to expand and re-use in the future.

edit: updated output_handle as per request in the comments

score 0 · Answer 3 · 2017-06-21

0

Entering edit mode

6.8 years ago

Rodrigo ▴ 190

Seems like the problem is that cmd = ['parsing blast'] is a sequence containing a function and not a process.The subprocess module is for spawning processes and doing things with their input/output - not for running functions, as it is explained Here.

ADD COMMENT • link 6.8 years ago by Rodrigo ▴ 190