Biostar Beta. Not for public use.
Read In Multiple Text Files to Calculate a Threshold Python
0
Entering edit mode
2.3 years ago

I have ~ 400 text files of BLASTp results.
P_1L_GRL111.out (contents)

NP_387917.1 ADZ06570.1 44.29 289 153 4 2 289 1 282 7e-77 236

I have a file that reads in files ALL my P_1 files, sorts the e values, and then calculates a threshold. I would like to convert this into a function. I have made some attempts to define a function that can read in all my files, but as of now, the files only reads ALL P_1 files, or ALL P_2 files. I also feel like the function is not sorting the values correctly.

def threshold_proteinsVStarget(protein_Files):
# List of Lactobacillus databases created from BLAST commands
Lactobacillus_DB = [ 'L_GRL1112', 'L_214','L_CTV-05','L_JV-V01','L_ST1','L_MV-1A', 'L_202-4','L_224-1','L_JV-V03',    'L_MV-22','L_DSM_13335', 'LactinV_03V1b', 'SPIN_1401G', 'UPII_143_D', 'L_1153','L_269-3', 'L_JV-V16','L_49540']

list_e_values = []
for prot in protein_Files:
    for db in Lactobacillus_DB:
        line = open (prot + db + '.out').readline()
        print line
        line2 = line.strip('\n')
        fields = line2.split('\t') 
        e_val = float(fields[10])
        list_e_values.append(e_val)
    return list_e_values

file1 = ['P_3']

print threshold_proteinsVStarget(file1)

Python E-Value • 1.7k views
ADD COMMENTlink
1
Entering edit mode
18 months ago
st.ph.n ♦ 2.5k
Philadelphia, PA

You don't specify what the threshold is. You don't have to specify the prefixes of the blast output. Look into glob, to read all the files in the directory, that end with .out, similarly if you were to do ls *.out . The code below will get you to a sorted list of e-values, from all files.

#!/usr/bin/env python
import glob
for file in glob.glob('*.out'):
    with open(file, 'r') as f:
        for line in f:
                    if float(line.strip().split('\t')[10]) < *X*:
                    print line.strip().split('\t')[0], '\t', line.strip().split('\t')[1], '\t', line.strip().split('\t')[10])
ADD COMMENTlink
0
Entering edit mode

This is very useful.... I was trying to work with the glob module, but I initially thought it would be best to combine ALL my files into one text file, then sort, then return the e-value...but that was not working properly.

import glob 
list_files = []
for file in glob.glob('/Users/sueparks/BlastP_Results/*'):

    myfile = open(file,'r')  #open each file
    lines = myfile.readlines() # read lines of each file

    with open('merge_BLASTP_results.out', 'a') as f:
        for line in lines:
            line.strip('\n')
            f.write(line)
ADD REPLYlink
1
Entering edit mode

Using python to merge your files is unnecessary. Just cat all your files into one, then run your code on it, then. cat *.out > allblast.out

Then you can use my above code, without the for loop with glob, and just open and read the file, append each evalue to a list, and sort it.

ADD REPLYlink
0
Entering edit mode

Thanks st.ph.n! My last question, if I write another function to check e value, how can I append the first two columns with the e value.

My last step:

if eval is less than threshold: # append fields [0], [1],[10] print NP_387917.1 ADZ06570.1 7e-77

ADD REPLYlink
0
Entering edit mode

See my updated answer. Replace X with the value you want. This will check the evalue as it loops through each line, instead of appending to list. If you need a list, use a tuple or dictionary:

evalues = []
if line.strip().split('\t')[10] < *X*:
        evalues.append((line.strip().split('\t')[0], line.strip().split('\t')[1], line.strip().split('\t')[10]))

for e in evalues:
        print e[0], '\t', e[1], '\t', e[2]
ADD REPLYlink
0
Entering edit mode

The updated answer seemed to do the trick! How could I transform the updated code into a function?

ADD REPLYlink
0
Entering edit mode

if all you're going to do is return a list from the function, you don't need to use a function. Just append everything to a list as int he comment. If you do need function, it should be trivial for you to do so. If this answer solves your problem, please accept the answer.

ADD REPLYlink
0
Entering edit mode
2.3 years ago

You are correct, I did not upload the entire code. I have a separate function for the threshold (the median calculation).

import subprocess, sys, math


def median(a):
a.sort()
if len(a)%2!=0:
    median=a[len(a)/2]
else:
    median=(float(a[len(a)/2]+a[(len(a)/2)-1]))/2
return median


# Define a function that reads in .out files to a list
def threshold_proteinsVStarget(protein_Files):

# List of Lactobacillus databases created from BLAST commands
Lactobacillus_DB = [ 'L_GRL1112', 'L_214','L_CTV-05','L_JV-V01','L_ST1','L_MV-1A', 'L_202-4','L_224-1','L_JV-V03', 'L_MV-22','L_DSM_13335', 'LactinV_03V1b', 'SPIN_1401G', 'UPII_143_D', 'L_1153','L_269-3', 'L_JV-V16','L_49540']

list_e_values = []
for prot in protein_Files:
    #print prot
    for db in Lactobacillus_DB:
        line = open (prot + db + '.out').readline()
        print line
        line2 = line.strip('\n')

        fields = line2.split('\t')
#print fields

        e_val = float(fields[10])
        list_e_values.append(e_val)
    return list_e_values


print median (list_e_values)
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1