Odd max query limit in BLAT
1
1
Entering edit mode
5.9 years ago

I'm running BLAT with this command:

blat genome.fasta query.fasta blat_90.csv -minIdentity=90 -noHead

And I'm surprised that I've counted the results and order by count.

import pandas as pd
import sys
df_blat = pd.read_csv('blat_90.csv', index_col=False, sep='\t')
s = df_blat.qName.value_counts()
count = 1
for k in s.head(100).index.tolist():
    count += 1
    print(s[k],k)

The interesting thing is that the max results per query is 672 and this number is repeated about 30 times. Then I have queries that were repeated 671 times again many times.

672 MITE_T_71978|chr3D|340455280|340455443|AT|175|F2975
672 MITE_T_97305|chr7B|194283260|194283371|AT|29|F4298
672 MITE_T_110543|chr5D|34445518|34445608|TA|94|F5023
672 MITE_T_72475|chr3D|503092467|503092630|AT|174|F2988
...

So this makes me wonder if there's a limit in the number of times a query is search. I've repeated this script with no different results.

UPDATE: I've notice that running only one query with each wheat chromosome in a separate thread, the max results per search is 32.

This is how the results look like when I search only one sequence agains each chromosome at the time:

 wc -l b*
           32 b1A.csv
           32 b1B.csv
           32 b1D.csv
           32 b2A.csv
           32 b2B.csv
...

Max 32 hits per search (accordingly 32 x 21 chromosomes is 672)

There must be a limit in the code but I'm still not able to find it

blat • 1.7k views
ADD COMMENT
0
Entering edit mode

Considering Jim Kent is author of blat (and how long blat has been around) it seems very unlikely that you are running into some odd bug in blat. Do you get a different number if you increase the tile size (-tileSize=20)?

Have you tried taking a sequence from human genome and testing it against that? I will try it out later today.

ADD REPLY
0
Entering edit mode

default tileSize of 11 is what I need in this case, I'll give a try (I do not understand the reason, and max is 18). Why a sequence from human? I'm using wheat genome. Also I did not know who Jim Kent is till now, but I guess any programmer could make mistakes.

ADD REPLY
0
Entering edit mode

Human was a test. Just something other than wheat in case that genome has some strange replicative characteristic you were picking up in this search.

ADD REPLY
0
Entering edit mode

changing the tileSize to 17 gives 3.240.017 results vs the 32 in the original command.

ADD REPLY
0
Entering edit mode

So 672 is not a hard limit in blat. You are observing some characteristic in your data. I am not sure what kind of data this is.

ADD REPLY
0
Entering edit mode

Indeed. Oddly a larger tileSize gives more result that default. It is the wheat genome, a large and repetitive one.

ADD REPLY
0
Entering edit mode

I was just thinking that too. Increasing the tile size should have given fewer results. This result would indicate that the increased tile size is allowing more initial seed alignments?

ADD REPLY
0
Entering edit mode

Could be, sounds like an undocumented feature (or an issue) since 3M results vs just 32 does not makes sense.

ADD REPLY
0
Entering edit mode

I've realized that specifing even default value (11) gives results as excected. Thanks for the advise.

ADD REPLY
0
Entering edit mode

@Juan: Is there still a question that is outstanding? I assume it is "the max results per search is 32".

Are you using the latest blat available? Did you get the precompiled executables or compiled it yourself?

ADD REPLY
0
Entering edit mode

I've used the latest and compile it myself

https://users.soe.ucsc.edu/~kent/src/blatSrc35.zip

Indeed I've realized that specifying the tileSize does not solve the issue

Also tried with

http://hgwdev.cse.ucsc.edu/~kent/exe/linux/blatSuite.36.zip

nohup ./blat /media/crescentejuan/Data/iwgsc_by_chr/chr1A.fasta  ../blat/query.fa /media/crescentejuan/Data/blat_built_demo.csv  -minIdentity=95 -noHead -tileSize=11  &

and also got:

wc -l blat_built_demo.csv
32 blat_built_demo.csv
ADD REPLY
0
Entering edit mode

I am going to suggest that you either directly write to Jim Kent (you should be able to find his email at UCSC and as an aside you should know who he is) or post to the UCSC genome browser list about this problem.

ADD REPLY
3
Entering edit mode
5.9 years ago

As responded by UCSC itself:

The 32 results are indeed a consistent limitation where BLAT only allows 16 results per chromosome and per strand. BLAT is unable to exhaustively return results for queries with repetitive sequences that map to many places in the genome. There may be many other sequence aligning programs available that do not have such limits, but we do not have any specific recommendations about which to use.

ADD COMMENT
0
Entering edit mode

Thank you for this. Glad there is a logical explanation.

ADD REPLY

Login before adding your answer.

Traffic: 3101 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6