Biostar Beta. Not for public use.
News: Misunderstood parameter of NCBI BLAST
5
Entering edit mode

Hi,

An interesting paper that I just wanted to share it with Biostars:

Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows

ADD COMMENTlink 17 months ago Farbod ♦ 3.3k
Entering edit mode
1

I wrote related entry (& moved it to) I couldn't reproduce the problem of max_target_seqs according to the suggestion made by genomax.

ADD REPLYlink 18 months ago
fishgolden
• 390
Entering edit mode
0

You should create a new post if you think this is a reproducible problem. Posting it in this thread is not the best place for this.

ADD REPLYlink 18 months ago
genomax
68k
Entering edit mode
0

Uh, really? I never thought of that. Thank you for the suggestion. Ok, I'll create a new post. But I do not think the problem (in the paper) is reproducible and I thought I had written as such.

ADD REPLYlink 18 months ago
fishgolden
• 390
Entering edit mode
0

What I meant was that not being able to reproduce the results in the paper may be a reproducible problem, if others get the same results as you did.

ADD REPLYlink 18 months ago
genomax
68k
Entering edit mode
0

I see, thank you. (I'm googling how to close my answer...) edited: Ok I think i did fine.

ADD REPLYlink 18 months ago
fishgolden
• 390
Entering edit mode
0

Certainly not a recent 'problem' as this issues has been raised many years ago:

blast-max-target-sequences-bug

ADD REPLYlink 18 months ago
lieven.sterck
5.1k
4
Entering edit mode

I will cite this with the fair use policy:

To enable the efficient processing of large data sets, researchers frequently rely on shortcuts aimed at reducing the number of BLAST results that need to be processed. A common strategy involves using the "- max_target_seqs" parameter of the NCBI BLAST+ suite. According to the BLAST documentation itself (2008-), this parameter represents the "number of aligned sequences to keep". This statement is commonly interpreted as meaning that BLAST will return the top N database hits for a sequence query if the value of max_target_seqs is set to N. For example, in a recent article (Wang, et al., 2016) the authors explicitly state "Setting “max target seqs” as “1,” only the best match result was considered."

To our surprise, we have recently discovered that this intuition is incorrect. Instead, BLAST returns the first N hits that exceed the specified E-value threshold, which may or may not be the highest scoring N hits. The invocation using the parameter "-max_target_seqs 1" simply returns the first good hit found in the database, not the best hit as one would assume.

Worse yet, the output produced depends on the order in which the sequences occur in the database. For the same query, different results will be returned by BLAST when using different versions of the database even if all versions contain the same best hit for this database sequence.

ADD COMMENTlink 18 months ago Istvan Albert 80k
Entering edit mode
3

To be fair the option does not say -max_**best**_target_seqs or -max_**high_scoring**_target_seqs.

ADD REPLYlink 18 months ago
genomax
68k
Entering edit mode
1

That is fair to some extent - in my opinion the source of the confusion that it says max there - instead it should be called limit or even better first to make it unambiguous.

Usually, there are many things to juggle and consider in a typical analysis - it is very easy to slip up and take the max as maximal in a different context: the score or some other attribute.

ADD REPLYlink 18 months ago
Istvan Albert
80k
Entering edit mode
1

Also from the same paper:

The confusion is further compounded by the fact that in the online BLAST portal, the max_target_seqs parameter behaves in the expected way – the best (rather than first) N hits are returned

ADD REPLYlink 18 months ago
Istvan Albert
80k
Entering edit mode
0

NCBI does things with the web version that are not available in the command line package. That does lead to some confusion since it is easy to assume/think that those two are equivalent.

ADD REPLYlink 18 months ago
genomax
68k
Entering edit mode
0

Yeah, but the option is basically useless as it is. Like, in what kind of use case would the output make sense? Why not have an option -max_random_hits while at it..

ADD REPLYlink 17 months ago
5heikki
8.4k
1
Entering edit mode

Just noticed it in the book titled A Primer for Computational Biology where it explicitly states best

https://www.amazon.com/Primer-Computational-Biology-Shawn-ONeil/dp/0870719262

enter image description here

ADD COMMENTlink 18 months ago Istvan Albert 80k
Entering edit mode
0

Is that in reference to the web portal?

ADD REPLYlink 18 months ago
genomax
68k
Entering edit mode
0

I did not check at the time, but note how it talks about output format 6, 7 or 10 that sounds to me like command line use - I doubt that would be an online parameter one would set

ADD REPLYlink 18 months ago
Istvan Albert
80k
Entering edit mode
1

Author here! I had no idea about this behavior until recently. I'm glad to have learned of it though - I've updated the online version of the book with some errata linking to the paper for details.

ADD REPLYlink 18 months ago
oneilsh
• 10
1
Entering edit mode

It's only unfortunate the paper omitted to state that this affects all the filtering parameters in the blast algorithm, such as Evalue, num_alignments,

Otherwise good to finally see this issue described in a manuscript context instead of blog posts.

ADD COMMENTlink 18 months ago lieven.sterck 5.1k
0
Entering edit mode

(solved) I couldn't reproduce the problem of max_target_seqs I could reproduce the problem and my conclusion is that the problem is not caused by the matter written in the paper, but caused by what NCBI staff explained in 2015.

ADD COMMENTlink 17 months ago fishgolden • 390

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0