Question

Why do blastn results differ when you swap query and subject FASTAs?

1

Entering edit mode

7.5 years ago

shokin ▴ 30

I am completely mystified by this very basic asymmetry in blastn results. I have two FASTA files, which share a 15-mer. However, that common 15-mer only results in a hit if I have one file as subject and the other as query; if I reverse the files, no hit:

$ blastn -task blastn-short -outfmt 6 -ungapped -strand plus -perc_identity 100 -word_size 15 \
-query Phvul.007G125800.5.0kb.upstream.fasta \
-subject Phvul.009G200800.5.0kb.upstream.fasta 
Phvul.007G125800    Phvul.009G200800    100.00  15  0   0   3698    3712    1965    1979    0.020   30.2
$ blastn -task blastn-short -outfmt 6 -ungapped -strand plus -perc_identity 100 -word_size 15 \
-query Phvul.009G200800.5.0kb.upstream.fasta \
-subject Phvul.007G125800.5.0kb.upstream.fasta 
$

This is highly reproducible and has some consistency. I've got five 5000-nt sequences, all of which share this same 15-mer. When I use two of them as query sequences, I get the 15-mer hit in all cases. When I use the other three as query sequences, I never get the 15-mer hit. If I blast just the 15-mer by itself against the sequences, I get the hit on all sequences.

Any ideas what's going on? This behavior is independent of word_size, by the way, all the way down to 8. I find this very disconcerting, since I thought that blastn would be symmetric w.r.t. query and subject, at least when they're the same size.

blast • 2.2k views

ADD COMMENT • link 7.5 years ago by shokin ▴ 30

0

Entering edit mode

And since you were about to ask, this behavior is reproduced in versions 2.2.31+ (commonly found in distros) and the latest from NCBI, 2.5.0+.

ADD REPLY • link 7.5 years ago by shokin ▴ 30

0

Entering edit mode

Quite interesting. I have not tested and am guessing but the only thing that is different is the database-size and thereby the e-value. The default cutoff is 10 which is high to being with so this could not be the reason. But can you please check with evalue cutoff of more than 10 specified. I still use the legacy blast and not the new one and am only guessing.

ADD REPLY • link 7.5 years ago by microfuge ★ 1.9k

score 1 · Answer 1 · 2016-10-13

1

Entering edit mode

7.5 years ago

biofalconch ★ 1.1k

This thread explains it quite nicely

Query Vs. Target Using Blast2

ADD COMMENT • link 7.5 years ago by biofalconch ★ 1.1k

1

Entering edit mode

Thanks! It's definitely not an evalue issue but, rather, the somewhat obtuse asymmetry of blast itself, it appears, as discussed in the above thread (which I didn't find earlier). I'll have to work around it. Much appreciated, biofalconch!

ADD REPLY • link 7.5 years ago by shokin ▴ 30

score 1 · Answer 2 · 2016-10-14

It turns out that query size really matters in BLAST. I've done some testing with my 5000nt sequences, and in some cases I've had to narrow the query_loc range down to 56nt to capture a hit against a common 16-mer! Furthermore, in one case, there is only a range of five starting positions in query_loc range that will result in the hit. This is very restrictive!

To put my example in numbers so it's clear: I have a 5000nt query sequence with a 16-mer motif at 1965-1980 which is shared by another 5000nt subject sequence. The only range of query_loc that results in a successful hit of this 16-mer against the subject using the experimentally-determined maximum length of 56 is: 1962-2017, 1963-2018, 1964-2019, 1965-2020 and 1966-2021. Outside those query_loc ranges, blast does not find the match.

I haven't been able to find any BLAST parameters that improve this situation, but if you know of something to try, please let me know! This narrow restriction on the size of the query is a show-stopper for doing the analysis with BLAST.