How to separate a BLAST ouput file (tabular form) into files containing the hits for each protein searched.
0
0
Entering edit mode
5.0 years ago
Becca • 0

Hi there,

I was wondering if you could help me. So I have done a multi-sequence blastp search which has generated an output.tsv file. I need to separate that tsv file into separate files containing the hits for each protein search. So for protein 1 have all the information of the hits into one file. And then protein 2 into a separate file. I tried to do this by limiting the target sequences to 10 and then splitting it by line number. So 10 hits go in each file but there are some proteins with 3 or 4 hits so then it messes up the separation. And I have to do this on Python !

I am in dire need of some help!

I know you can parse a blast out file but how would I then direct all the hits for each protein into a different file.

Any help would be really appreciated!! Thank you

blast python • 1.3k views
ADD COMMENT
0
Entering edit mode

Providing an outline for a non-python solution.

cut the first column out and then uniq that list to get sequence ID's that have hits. Then use grep in a loop with -w option to extract lines that contain that ID.

And I have to do this on Python !

Is this an assignment?

ADD REPLY
0
Entering edit mode

No it's not an assignment. But I'm working with someone who only uses python. I could do it if I didn't have to use python but using python is confusing me slightly... Thank you though for your input.

ADD REPLY
0
Entering edit mode

Is this standard -outfmt 6 tabular output? Are you python savvy or no?

It may be something like this (something I found on web) :

OR

https://www.reddit.com/r/bioinformatics/comments/4ef5p8/how_to_filter_blast_results_using_biopython/

ADD REPLY
0
Entering edit mode

Yeah maybe. I will have a look into that. I usually use pandas to make a dataframe to make plots and things.

Or I was thinking something like this :

from blast import parse

fh = open('blast.tsv')
for blast_record in parse(fh):
    print('query id: {}'.format(blast_record.qid))
    for hit in blast_record.hits:
        for hsp in hit:
            print('****Alignment****')
            print('sequence:', hsp.sid)
            print('length:', hsp.length)
            print('e value:', hsp.evalue)

Output would look like:

query id: cgl|CAGL0A00187g
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 90)
('e value:', 0.0001)
****Alignment****
('sequence:', u'ecy|Ecym_8168')
('length:', 44)
('e value:', 0.0007)
****Alignment****
('sequence:', u'ecy|Ecym_4273')
('length:', 84)
('e value:', 0.64)

But I don't know how to all of that direct that into a file... I'll have a think...

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

And to add to this, as soon as you have it in a DataFrame you could use something like the following loop (untested):

for query in df["query_id"].unique():
    df.loc[df"query_id"] == query].to_csv("blast_{}.txt".format(query), sep="\t", index=False)
ADD REPLY

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6