Biopython BLAST or Blast parser returns "no"
1
0
Entering edit mode
6.6 years ago
yarmda ▴ 40

I'm trying to parse my BLAST results and keep only records that don't have a gi identifier in a list. However, parsing the file is not quite working and the result is not informative.

Sorry if this type of question has been posted before. Search engines don't do well with the term "no"

from Bio import SeqIO
from Bio.Blast import SearchIO, NCBIWWW

#Forming Blast file. "record.seq" represents SeqIO.read("input.fasta", "fasta") where "input.fasta" is the sequence of the
#Bacillus anthracis strain with taxID = 1033843585

result_handle = NCBIWWW.qblast("blastn", "nt", record.seq, megablast=True)
result = open("new_tmp_blast.xml","w")
result.write(result_handle.read())
919480409
result.close()
result_handle.close()

#And the BLAST file is output in xml format, just like I wanted.

#Trying to parse the BLAST

hits = []
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
...     if entry.alignments:
...         hits.append(entry.query.split()[0])
...
hits
['No']
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
...     hits.append(entry.query.split()[0])
...
hits
['No', 'No']
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
...     entry.query.split()[0]
...
'No'
NCBIXML.parse(open("new_tmp_blast.xml"))
<generator object parse at 0x7fe6772430a0>

I have also tried using SearchIO and gotten identical results. I don't understand where the issue is.

Example of the BLAST result:

$ more new_tmp_blast.xml 

http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.7.0+</BlastOutput_version>
  <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
  <BlastOutput_db>nt</BlastOutput_db>
  <BlastOutput_query-ID>Query_123951</BlastOutput_query-ID>
  <BlastOutput_query-def>No definition line</BlastOutput_query-def>
  <BlastOutput_query-len>5227292</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_sc-match>1</Parameters_sc-match>
      <Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
      <Parameters_gap-open>0</Parameters_gap-open>
      <Parameters_gap-extend>0</Parameters_gap-extend>
      <Parameters_filter>L;m;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_123951</Iteration_query-ID>
  <Iteration_query-def>No definition line</Iteration_query-def>
  <Iteration_query-len>5227292</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|1033843585|gb|CP015779.1|</Hit_id>
  <Hit_def>Bacillus anthracis strain Tangail-1, complete genome</Hit_def>
  <Hit_accession>CP015779</Hit_accession>
  <Hit_len>5227292</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>9652680</Hsp_bit-score>
      <Hsp_score>5227132</Hsp_score>
      <Hsp_evalue>0</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>5227292</Hsp_query-to>
      <Hsp_hit-from>1</Hsp_hit-from>
      <Hsp_hit-to>5227292</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>5227292</Hsp_identity>
      <Hsp_positive>5227292</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>5227292</Hsp_align-len>
      <Hsp_qseq>ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGTTGTGGACAATTTTATTCCACAAGGTATTGATTTTGTGGATAACTTTCTTAATTTCATTGCTATAGCTACTTTTTTTTGATATTATAGTTGTGTTTTCACTTTGAATAAGTTTTCCACATCTTTATCTTATCCACAATTTGTGTATAACATGTGGACAGTTTTAATCACATGTGGGTAAATGATTATCCACAT
TTGCTTTTTTGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACATTTTATATTTATTCAGGTTGTACATTTGTTGCACAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACACCTTTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGTAAACCAAGTTATGAAACATGGTTAAAATCAACAACCGCACATAATTTAAAGAAAGATG
TATTAACAATTACGGCTCCAAATGAATTCGCCCGTGATTGGTTAGAATCTCATTATTCAGAGCTAATTTCGGAAACACTTTATGATTTAACGGGGGCAAAATTAGCTATTCGCTTTATTATTCCCCAAAGTCAAGCTGAAGAGGAGATTGATCTTCCTCCTGCTAAACCAAATGCAGCACAAGATGATTCTAATCATTTACCACAGAGTATGCTAAACCCAAAATATACGTTTGATACATTTGTTATTGGCTCTGGTAACCGTTTTGCTC
ACGCTGCTTCATTGGCCGTAGCCGAAGCGCCAGCTAAAGCATATAATCCCCTCTTTATTTATGGGGGAGTTGGACTTGGAAAAACCCATTTAATGCATGCAATTGGCCATTATGTAATTGAACATAACCCAAATGCCAAAGTTGTATATTTATCATCAGAAAAATTTACAAATGAATTCATTAATTCTATTCGTGATAATAAAGCGGTCGATTTTCGTAATAAATACCGCAATGTAGATGTTTTATTGATAGATGATATTCAATTTTTAG
CGGGAAAAGAACAAACTCAAGAAGAGTTTTTCCATACATTCAATGCATTACACGAAGAAAGTAAACAAATTGTAATTTCCAGTGATCGGCCACCAAAAGAAA

for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
    entry.query
'No definition line'

Not sure what that means, either. Something missing in my BLAST command that I could include?

BLAST biopython NCBIXML SearchIO • 1.7k views
ADD COMMENT
0
Entering edit mode
6.6 years ago

not biopython, but using a simple xslt stylesheet, if your xml fit in memory:

usage:

xsltproc --novalid  transform.xsl blast.xml
ADD COMMENT

Login before adding your answer.

Traffic: 4037 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6