Biostar Beta. Not for public use.
Group The Data In Python
1
Entering edit mode
2.1 years ago
User000 • 270

I have a huge input file like this,

con1    Traes4    99.26    135    1    0   139   543     1     135   5.0   279  1506   135
con13    Traes7    97.61    544    13    0   482   2113   200  544   0.0   101  2392  544
con13    Traes8    100.00    467    0    0   713   2113   1     467   0.0   901  2392   467
con15    EMT18    95.27    148    7    0   73     516    35    48    6.0   256   2560   148
con15    EMT29    89.86    148    15    0   73    516    100   148   3.0   243   2560   48


I am trying to extract contigs that have full lengths and % identity > 30, but I also allow +-10 aa differences,

start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30


However, I want the data be grouped by line[0]. For example, con13 has Traes8 that fits my request, and Traes7 that does not, but since I want all conX be grouped, it is enough that one of con13 meets my request, my output should print both of con13 in result1 and non of con13 should ever appear in result2 file. Note: I am also interested to print single con1 since it meets my requirement. Note 2: con13 might have >20 different Traes. Here is my script, It prints results but separately without grouping by con, how can this be solved?

from itertools import groupby

with open('file.txt') as f1:
with open('result1.txt', 'w') as f2:
with open('result2.txt','w') as f3:
for k, g in groupby(f1, key=lambda x:x.split()[0]):
hits = []
for line in g:
hsp = line.split()
hsp[9], hsp[13], hsp[8], hsp[2]= int(hsp[9]), int(hsp[13]),int(hsp[8]),float(hsp[2])
hits.append(hsp)
print line.rstrip()
#for i in range(1,len(hits)):
start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30:
f2.write(line.rstrip() + '\n')
else:
f3.write(line.rstrip() + '\n')


This is the result I get,Result1.txt

 con1    Traes4    99.26    135    1    0   139   543     1     135   5.0   279  1506   135
con13    Traes8    100.00    467    0    0   713   2113   1     467   0.0   901  2392   467


Result2.txt

con13    Traes7    97.61    544    13    0   482   2113   200  544   0.0   101  2392  544
con15    EMT18    95.27    148    7    0   73     516    35    48    6.0   256   2560   148
con15    EMT29    89.86    148    15    0   73    516    100   148   3.0   243   2560   48


But my desired outputs should be like this, Result1.txt

con1    Traes4    99.26    135   1      0   139   543    1     135   5.0   279  1506  135
con13    Traes7    97.61    544   13      0   482   2113    200  544   0.0  1010  2392  544
con13    Traes8    100.00   467  0      0   713   2113    1     467   0.0   901   2392  467


Result2.txt

con15    EMT18    95.27  148      7    0   73   516    35  48    6.0    256     2560  148
con15    EMT29    89.86  148     15    0   73   516   100 148   3.0   243     2560  148

1
Entering edit mode

As currently presented, this is just a programming question. Can you edit it a bit to make it clear how this relates to bioinformatics? Also, are you committed to using python or are you also open to using R (R has a number of functions that make splitting data like this and looking at the groups simple)?

0
Entering edit mode

I am trying to extract contigs that have full lengths, but I also allow +-10 aa differences, I just did not go into details, since calculations seem to be OK, and the file prints me results, but I am facing problem of grouping contigs and print them by group, as I explained above..so may be I am missing some important line in the code? I am not good in R, as I program only in Python, so I prefer using python, many thanks

0
Entering edit mode

OK, btw what is the purpose of "Result2" and why are both of the results from con15 in there? From the python code it looks like it should contain the Traes7 and EMT18 lines, since those 2 don't meet the filtering metrics that you set.

1
Entering edit mode

this is the point, the script I provided above does print Traes7, EMT18 and EMT29 in result2.txt, since all of them does not meet my requirement, but that is not my purpose, as I tried to explain above, if one of my contigs meet the requirement all other hits under that particular contig should be in result1.txt (even if others do not meet my requirement) so I need to group on line[0]. Later on, I extract only those single reads that meet my requirement from result1.txt and work with them separately. I need result2.txt, the rest of contigs as I perform other overlapping non overlapping operations on them.

1
Entering edit mode

Ah, your update has clarified things!

0
Entering edit mode

I am glad it is clearer now, it is not so easy to explain in few words..Anyway, my gut feeling tells me that the problem is because the program is reading group by line, so printing lines that meet my condition obviously, so may be I need to add a dictionary or set before grouping, not sure how to proceed, hope you guys can help me

0
Entering edit mode

You can look into Python Pandas dataframe groupby functionalities:

1) documentation: pandas.DataFrame.groupby

2
Entering edit mode
15 months ago
Freiburg, Germany

The following seems to be more what you're going for:

#!/usr/bin/env python
from itertools import groupby, tee
import csv

f = csv.reader(open("foo.txt", "r"), dialect="excel-tab")
of1 = csv.writer(open("results1.txt", "w"), dialect="excel-tab")
of2 = csv.writer(open("results2.txt", "w"), dialect="excel-tab")

for k,g in groupby(f, lambda x: x[0]) :
passed=0
g, g2 = tee(g)
for line in g :
if(float(line[2]) >= 30 and abs(int(line[13])-int(line[9])+int(line[8])+1) < 10) :
passed += 1
break

if(passed>0) :
for line in g2 :
of1.writerow(line)
else :
for line in g2 :
of2.writerow(line)


This will write an entire group to f1 if any of its members pass the filtering step. Otherwise, all of the members of the group will be written to f2. Note that in your original example, each of the 3 groups had at least one member passing the filter (EMT29 passes, so all of contig 15 would pass). Of importance is the usage of tee(). Once you start using an iterator (e.g., for line in g) you can't then rewind or reuse it. Ways around that would be to (1) store everything in an array yourself or (2) just use tee to make a copy of the iterator for you.

1
Entering edit mode

nice use of itertools! You can collapse the last bit depending on your aesthetic preference:

(of1 if passed else of2).writerows(g2)

0
Entering edit mode

0
Entering edit mode

THANK YOU SO MUCH, it works perfectly! I just had a moment of flash in my head and modified my script like this, so it works as well ))

from itertools import groupby

with open('file.txt') as f1:
with open('result1.txt', 'w') as f2:
with open('result2.txt','w') as f3:
contig = set([])
for k, g in groupby(f1, key=lambda x:x.split()[0]):
hits = []
for line in g:
hsp = line.rsplit()
hsp[9], hsp[13], hsp[8], hsp[2]= int(hsp[9]), int(hsp[13]),int(hsp[8]),float(hsp[2])
hits.append(hsp)
start_to_end=(hsp[9]-hsp[8])+1
if abs(hsp[13]-start_to_end) < 10 and hsp[2] >= 30:
if k in contig:
outfile = f2
else:
outfile=f3
for i in hits:
outfile.write('\t'.join(map(str,i)))
outfile.write('\n')

0
Entering edit mode

what do you think of the script from the memory usage point of view? I have > 15 mln reads, I guess using tee is more efficient in this case? p.s. I have never used tee

1
Entering edit mode

I imagine that the memory usage would be similar, since I expect tee() is making a similar sort of duplication in the background. One thing to change would be to not bother with a contig set, since you'll always just be interested in the last value anyway (the longer that gets, the slower things will get). It would be better to use a method like my passed=0 prior to for line in g and then just set passed=1 if it passes the filter (my passed +=1 is a relic of having an earlier iteration print out a status message so I could debug a silly mistake).