how to parse blast output using biopython
5
1
Entering edit mode
10.0 years ago
Camilla ▴ 10

Dear all,

I am new to Biopython and have tried to parse my output blast file.

My script is:

from Bio.Blast import NCBIXML
result=open("blast_anoC.xml","r")
records= NCBIXML.parse(result)
item=next(records)
for alignment in item.alignments:
          for hsp in alignment.hsps:
                 if hsp.expect <0.01:
                         print('****Alignment****')
                         print('sequence:', alignment.title) 
                         print('length:', alignment.length)
                         print('score:', alignment.score)
                         print('gaps:', alignment.gaps)
                         print('e value:', hsp.expect)
                         print(hsp.query[0:90] + '...')
                         print(hsp.match[0:90] + '...')
                         print(hsp.sbjct[0:90] + '...')

When I run it, Biopython executes it but doesn't print nothing...Why?

Many thanks,
Camilla

biopython blast • 23k views
ADD COMMENT
0
Entering edit mode

Is there a reason you're using XML output? A tab-delimited output (-outfmt 6), would be much easier to parse with standard python, or Linux commands. You can also run BLAST from inside your python script with BioPython.

ADD REPLY
1
Entering edit mode

Biopython doesn't totally support the plain text parsing now, not enough stable, so XML is the best recommanded output for Biopython parsing.

ADD REPLY
0
Entering edit mode

So, how to extract the query coverage from XML output?

ADD REPLY
0
Entering edit mode

Hello, I'm trying to get also the features of the alignment (the one is written in the picture) but I can't figure out how to do it.

My file is a xml that corresponds to several blast sequences, so I also have the question how to distinguish between queries. For each query I would have to get the one with highest score from genomics sequences (no transcript) and the feature corresponding to the name of the transcript (name of the gene that codify for protein, in this example Apoptosis regulator BCL2).

Thank you in advance,

Miguel

Example BCL2

ADD REPLY
0
Entering edit mode

Please open a new question and reference this post (how to parse blast output using biopython) there. Do not add an answer unless you're answering the top level question.

ADD REPLY
4
Entering edit mode
10.0 years ago

Hi Camilla,

There are two BUGS in your code. score and gaps are not member of alignment object. Correct code should be:

print('score:', hsp.score)
print('gaps:', hsp.gaps)

Use dir() to check members:

>>> dir(item.alignments[0])
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'accession', 'hit_def', 'hit_id', 'hsps', 'length', 'title']
>>> dir(item.alignments[0].hsps[0])
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'align_length', 'bits', 'expect', 'frame', 'gaps', 'identities', 'match', 'num_alignments', 'positives', 'query', 'query_end', 'query_start', 'sbjct', 'sbjct_end', 'sbjct_start', 'score', 'strand']

And here is how your code would run if you fix the bugs (I am using it on my BLAST file):

>>> for alignment in item.alignments:
... for hsp in alignment.hsps:
... if hsp.expect < 0.01:
... print('****Alignment****')
... print('sequence:', alignment.title)
... print('length:', alignment.length)
... print('score:', hsp.score)
... print('gaps:', hsp.gaps)
... print(hsp.query[0:90] + '...')
... print(hsp.match[0:90] + '...')
... print(hsp.sbjct[0:90] + '...')
...
****Alignment****
('sequence:', u'gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA')
('length:', 878)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...
****Alignment****
('sequence:', u'gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X4, mRNA')
('length:', 911)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...
****Alignment****
('sequence:', u'gi|568824603|ref|XM_006466624.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X3, mRNA')
('length:', 894)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...
****Alignment****
('sequence:', u'gi|568824601|ref|XM_006466623.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X2, mRNA')
('length:', 922)
('score:', 412.0)
('gaps:', 11)
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATATCAATGAGCTTAAAA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ||||||||||||| |...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATTTCAATGAGCTTAAGA...

Best Wishes,
Umer

ADD COMMENT
1
Entering edit mode
10.0 years ago
Peter 6.0k

Your code only looks at the first BLAST query, and presumably that gave no matches (under the e-value threshold). You could replace ``item=next(records)`` with ``for item in records:`` (and indent the rest of the code) to try looking at all the records.

ADD COMMENT
0
Entering edit mode
10.0 years ago

Maybe the threshold value in the if statement is too small.

ADD COMMENT
0
Entering edit mode
10.0 years ago
Camilla ▴ 10

I also tried with "hsp.expect <100"

...but is the same

:(

ADD COMMENT
0
Entering edit mode

did the blast return any significant hits?

ADD REPLY
0
Entering edit mode
7.2 years ago

I used the same script above to parse my blast output in XML. The error in the script was the line item=next(records). Instead I used item=records.next() and I got the output on screen for the same e-value without any compromise. Maybe you should try this!

ADD COMMENT
0
Entering edit mode

I think you are 2 years late.. And everything about biopython is writen here

ADD REPLY

Login before adding your answer.

Traffic: 2346 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6