Can't access individual attributes of tabular outfmt 6 blast output while parsing
2
0
Entering edit mode
9.4 years ago
Bara'a ▴ 270

Hi

I'm trying to parse a blast tabular format file (-outfmt 6) to access it's individual elements for further manipulation .

I need to retrieve the qseqid, qstart, qend, qlen, qseq, sseqid, sstart, send, slen, sseq and pident for each hit and here's the script :

blast_qresult=SearchIO.parse('Aegilops_Brachypodium.txt','blast-tab')
for i,blast_hit in enumerate(blast_qresult) :
    for m,blast_hsp in enumerate(blast_hit):
        for n in range(len(blast_hsp)):
            print("Query ID: %s"%blast_hit.id)
            print("Query start: %s"%blast_hsp[n].query_start)
            print("Query end: %s"%blast_hsp[n].query_end)
            print("Query seq: %s"%blast_hsp[n].query)
            print("Hit ID: %s"%blast_hit[m].id)
            print("Hit start: %s"%blast_hsp[n].hit_start)
            print("Hit end: %s"%blast_hsp[n].hit_end)
            print("Hit seq: %s"%blast_hsp[n].hit)
            print("Identity: %s"%blast_hsp[n].ident_pct)

Till here it does produce some results , and halts at certain hit throwing this error:

builtins.ValueError: Hit 'BRADI2G54940.1' already present in this QueryResult.

And when I try to access the other individual values like:

print(blast_hsp[n].seq_len)
print(blast_hit[m].seq_len)

They give these errors respectively :

builtins.AttributeError: 'HSP' object has no attribute 'seq_len'
builtins.AttributeError: 'Hit' object has no attribute 'seq_len'

My questions are :

  1. How can I access the qlen and slen attributes?
  2. Why does

    print("Query seq: %s"%blast_hsp[n].query)
    print("Hit seq: %s"%blast_hsp[n].hit)
    

    give this output

    Query seq: None
    Hit seq: None
    

    How can I retrieve the sequences for both query and hit?

  3. Both

    print(blast_hit[m].id)
    print(blast_hit[n].id)
    

    give the sseqid, which one of them is more accurate?

  4. How can I overcome the error that complains about redundant Hit ID's in the tabular report?

  5. I noticed that the qstart and sstart returns values decremented by one (values -1) indices ... why is that? Is it safe to use them for manipulation as is or shall I increment them by 1?

I would be grateful for any help ... thanks.

biopython blast • 6.8k views
ADD COMMENT
1
Entering edit mode

What version of Biopython do you have? Can you share the data file causing the problem? e.g. using http://gist.github.com

ADD REPLY
0
Entering edit mode

I'm using Biopython 1.64 .

Here's a portion of the file that's originally contains 13072 hits:

https://gist.github.com/Baraa

Edit : modified the portion of the file where the hit causing the error locates.

ADD REPLY
1
Entering edit mode

That sample does not include the problematic identifier BRADI2G54940.1 so probably won't help :(

ADD REPLY
0
Entering edit mode

Speaking in general ... how can I overcome this problem?

Edit: updated the file that contains the hit in question.

ADD REPLY
1
Entering edit mode

I can now see BRADI2G54940.1 in the file once. Can you run this at the command line to see where else it appears in the file?:

grep "BRADI2G54940.1" your_blast_file

Perhaps also try with some neighbouring lines for context:

grep -C 2 "BRADI2G54940.1" your_blast_file

ADD REPLY
0
Entering edit mode

I didn't have to run that command ... the error was gone - suddenly - for this particular file Brachyoidium_Japonica.txt, but not for the Horedum_Japonica.txt!

Previously , the interpreter raised the error complaining about the redundant hit ID BRADI2G54940.1 , but now blast produces the result containing one hit with BRADI2G54940.1 ID !!

BTW ... I haven't changed anything in my script lately !!

I'm really confused now ... what a strange blast behavior .

Would you please explain to me what that command is used for?

ADD REPLY
3
Entering edit mode
9.4 years ago
Peter 6.0k

Bio.SearchIO is not intended to simply expose the BLAST tabular fields as they are (for that you could use the Python standard library module csv or similar).

Rather Bio.SearchIO turns BLAST tabular and many other search results into a common abstracted object representation. For example, start/end values are converted into Python style interval counting (which is where the -1 on the start values comes from).

ADD COMMENT
0
Entering edit mode

Thank you for explaining me how does python represent QueryResult objects of Bio.SearchIO.

In fact, I managed to get access to the attributes I wasn't able to retrieve in the previous code... I've just invoked the customized fields option in parsing command:

blast_qresult=SearchIO.parse('Brachypodium_Japonica.txt','blast-tab',fields=['std','qlen', 'slen'])

and it does produce some results , but I'm not completely sure if they were accurate, since the following error remains unhandled:

builtins.ValueError: Hit 'OS01T0274800-01' already present in this QueryResult.
ADD REPLY
1
Entering edit mode

That list of fields should match whatever you asked BLAST+ to produce with the -outfmt 6 option (by default the standard 12 fields it uses by default). This is not a fields you are interested in!

ADD REPLY
1
Entering edit mode

The output file seems to contain 14 columns, though. @Bara'a, could you paste the command you used to get this results? If you did indeed specify two extra columns (qlen and slen), then I think your fields argument is correct.

ADD REPLY
1
Entering edit mode

Opps. I missed you used "std" which is short hand for the standard 12 columns.

ADD REPLY
0
Entering edit mode
comline=NcbiblastxCommandline(cmd='blastn', query=query+".fasta", db=database+".db", evalue=0.01, outfmt='"6 std qlen slen"', max_target_seqs=1, out=outname )
blast_qresult=SearchIO.parse('Bracypodium_Japonica.txt','blast-tab',fields=['std','qlen','slen'])

But the problem now is not about accessing those attributes, it's about parsing them smoothly without duplicates as the latter error complains.

The max_target_seq used to temporarily overcome this problem, but after I missed up with the Query.py module it didn't work any more!!

Any suggestions?

ADD REPLY
0
Entering edit mode

I've consulted my thesis supervisor for the latter error in the code , and suggested me to alter the Query.py module to stop raising the error !!

He asked me to comment the following lines in the append function in the module :

 # if a custom hit_key_function is supplied, use it to define th hit key
        if self._hit_key_function is not None:
            hit_key = self._hit_key_function(hit)
        else:
            hit_key = hit.id

        '''if hit_key not in self:
            self[hit_key] = hit
        else:
            raise ValueError("Hit '%s' already present in this QueryResult." %
                    hit_key)'''

Of course, I was faint-hearted to miss up with the module structure ... and it was indeed a big mistake as it result in a very strange behavior!!

When I tried to re-run the code on a different blast tabular file, it gave me this again:

builtins.ValueError: Hit 'OS01T0274800-01' already present in this QueryResult.

beside:

builtins.SystemError: Parent module '' not loaded, cannot perform relative import

despite of rolling back the change on the module!!

ADD REPLY
0
Entering edit mode

I have noticed that printing those attributes directly to the screen does work, but not when I try to save them to some external file!

Why is that?

ADD REPLY
0
Entering edit mode

I'm not sure what you are describing here.

ADD REPLY
0
Entering edit mode

Never mind @Peter ... the real question here was scattered by the process workflow!

I will try to put things together , and come back with an appropriate solution soon - with GOD willing - :)

ADD REPLY
3
Entering edit mode
9.4 years ago
bow ▴ 790

I see some of the questions have been addressed in the comments, so I will respond to some that have not (paraphrased):

1. How can I access the qlen and slen attributes ?

Since qlen is a property of the whole query, accessing it can be done via blast_qresults.seq_len in your case. Similarly, slen can be accessed via blast_hit.seq_len. The full API documentation for the BLAST format in SearchIO is here: http://biopython.org/DIST/docs/api/Bio.SearchIO.BlastIO-module.html.

2. Why does ... does not give None attributes?

The parser can only parse data present in the input file of course. In your case, the tabular data does not store any sequence information, so the query and hit attributes of your blast_hsp object does not contain anything. If there were sequences, they would be SeqRecord objects. To add the sequence to your output, I think you need to add a custom column for this (or just use another format like BLAST XML).

3. Both ... gives the sseqid, which one of them is more accurate?

Did you get different results from them? They should all give the same answer (if retrieved from the same Hit object).

4. How can I overcome the error that complains about redundant Hit ID's in the tabular report ?!

I need to take a look into this first. I can't see the problematic hit (OS01T0274800-01) in your uploaded gist, but my guess is the database you have may have duplicate entries of OS01T0274800-01. If that is the case, then you will need to rebuilt your database without the duplicate and then rerun the query.

5. I noticed that the qstart and sstart returns values decremented by one ( values -1 ) indices ... why is that? Is it safe to use them for manipulation as is or shall I increment them by 1 ?!

This one has been addressed by Peter :).

ADD COMMENT
1
Entering edit mode

One of the quirks of makeblastdb is it currently tolerates duplicate identifiers in the input FASTA file when making a database, which results in ambiguous output - see http://blastedbio.blogspot.co.uk/2013/12/blast-should-keep-its-blordid.html - I think this is an NCBI bug and hope they change it.

ADD REPLY
0
Entering edit mode

If so , then I think I have to report this as a bug to the official NCBI site .

Hope they can fix this soon ... till then , I should figure something out to work around this obstacle , like writing a function to handle the duplicates in the blast result files .

Wish me luck with that :)

ADD REPLY
0
Entering edit mode

If that was the problem then please do email the NCBI BLAST team (link to the blog post and this page too) to help them prioritise.

ADD REPLY
0
Entering edit mode

I will as soon as I get my hands clear of the tasks I'm working on.

Thanks a lot @Peter... I indeed appreciate your help.

ADD REPLY
0
Entering edit mode

@bow .. thank you for your comprehensive reply, it really helped a lot.

Regarding the commands:

print(blast_hit[m].id)
print(blast_hit[n].id)

Yes, they do give the same output.

But my question was: which command is more accurate, from a conceptual perspective?

Don't they refer to two different levels of hierarchy in the QueryResult object?

ADD REPLY
0
Entering edit mode

@bow ... I've just added the file that contains the redundant hit (OS01T0274800-01): https://gist.github.com/Baraa

Please , note that this redundancy issue was raised previously by the file (Brachypodium_Japonica.txt) , and handled suddenly in a strange manner on it's own without any interference from me of any kind neither by modifying the script nor by correcting the raw data , as I mentioned above in my reply to @Peter !!

But it raised again in this new blast query :(

Why do you think this is happening ?!

It's driving me crazy .. Uhhh :s

ADD REPLY

Login before adding your answer.

Traffic: 2400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6