Question

UCSC blat command-line version

0

Entering edit mode

7.8 years ago

alessandro_d • 0

Hello, I need to find alignments of several sequences (80000) against the whole genome using blat but the web version only allows 25 sequences at a time, so I downloaded blat for local use in order to be able to do that, but I am struggling a bit. I use the Mac terminal. The usage is: blat database query [-ooc=11.ooc] output.psl, where database and query are .fa files. I put all database, query and output files in the blat directory and then I positioned myself in the blat directory and wrote the code "blat database query [-ooc=11.ooc] output.psl" but the answer is "blat:command not found". Can anybody help? Thanks averyone, Alessandro

blat commandlineblat ucsc • 12k views

ADD COMMENT • link updated 7.8 years ago by Emily 23k • written 7.8 years ago by alessandro_d • 0

0

Entering edit mode

I'm (slightly) ignorant with regard to OS X installation, but assuming running this is unix-like, you also need to make sure that the blat executable is in your $PATH.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

BLAT is best for aligning transcripts -> genome. It's not really designed for anything else, even though you can use it for other sequences by playing with the options. Are you 80.000 sequences really gene transcripts, cDNAs or ESTs ?

ADD REPLY • link 7.7 years ago by Maximilian Haeussler ★ 1.6k

GenoMax · Accepted Answer · 2016-07-15

3

Entering edit mode

7.8 years ago

GenoMax 141k

Hopefully you downloaded the correct blat executable for OS X. You may need to add execute permissions for it. Do so by (you will need to provide the password for admin acct)

sudo chmod u+x ./blat

Once that is done you are ready to run blat by following command (database and query in the command below will be replaced by real fasta format file names). You will need to run this command multiple times for each file you want to search with. If you are going to want to parse the output later then choose an appropriate format (may want to select blast8 as in example below). The output file should be named individually for each search.

./blat database query -out=blast8 blat.out

Having said all that, what are the 80000 sequences from? Can they be compressed into a smaller number by clustering? This can make the job at hand a bit easier.

ADD COMMENT • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Thank you very much. The code I was using was " blat database query [-ooc=11.ooc] output.psl" which I found on UCSC website but the answer was "blat: command not found" everytime, while it works with your code, so maybe the website is not updated or something like that. The last problem now is with the fasta file format. I have a file with every line starting with >ENS and then the sequence in the following line and so on for all the sequences, but the file is .txt while blat requires .fa so I need to find a way to convert .txt into .fa. How could I do it? Anyway, the 80000 sequences are downloaded from Biomart and they are 3'-UTR sequences, so I can't compress them by clustering because I need the alignment of each of them. Can I use a unique query file with all the 80000 query sequences in it? Or do you think I should divide the sequences into smaller groups of something like 20000 sequences each? Thank you very much!

ADD REPLY • link 7.7 years ago by alessandro_d • 0

1

Entering edit mode

The reason ./blat works is because we are telling the Mac that the file is in current directory. To add that directory to your $PATH variable do export PATH=$PATH:/path_to_dir_with_blatin that terminal. I suggest that you spend some time at this site learning basics of unix.

Blat should not care that the file is named .txt (as long as the file contents are text and in fasta format). You could do mv your_file.txt new_file.fa, if you want the .fa extension.

Please select an output format that you would be able to parse (psl or blast8). A single file with 80K queries is ok but what are you blatting against? Depending on that you may need enough RAM.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Thank you for your help and for the Terminal guide, it is very helpful since I'm approaching standalone blat for the first time. Now it doesn't say error or anything like that but when I write the command in the terminal I get no answer at all and it stops working, I mean if I type any other command it doesn't work. I thought it was because it was busy with the blat search so I tried using just one sequence as a query against only a few sequences but the outcome is the same. Any ideas of what I'm doing wrong? Anyway, I'm blatting my sequences against the unspliced transcript sequences I got from BioMart (ensembl). Thank a lot

ADD REPLY • link 7.7 years ago by alessandro_d • 0

0

Entering edit mode

Can you show us the command you used? You should only need (if you need some other format remember to add -out= option also).

blat database_file query_file output.psl

You could run the command in one terminal (and open a second one as a new tab or a new window and change to the same directory where you are running the blat and watch to see if the output file is growing).

To convince yourself that the process is working use a single sequence each in database/query files and that process should finish quickly.

How big is the file you are searching against? With 80K queries this process may take on the order of day(s).

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

I used: ./blat HGNC_database.txt HGNC_query.txt [-ooc=11.ooc] output.psl where HGNC_database and HGNC_query are my database and query. I tried to align one single sequence against another single sequence but can't have any answer. I also tried to change to the current directory in another tab but the output file isn't even created. Plus I used the top command to display all processes and I found that Terminal is "sleeping" after I write the blat command. Anyway, my database is 1,1 Gb big while query file 100 Mb.

ADD REPLY • link 7.7 years ago by alessandro_d • 0

0

Entering edit mode

In unix something like [-ooc=11.ooc] signifies an optional argument. If you plan to use it then the [ ] (brackets) are not included in the actual command. In your case if you are happy with the PSL output then the command needs to be just this: ./blat HGNC_database.txt HGNC_query.txt output.psl. Try that and let us know what happens.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Here is what I do: First I open the Terminal, then I use the command cd blat to position myself in the blat directory, then I use the command ./blat HGNC_database.txt HGNC_query.txt output.psl (where in this case database and query are made up of one sequence only, just in order to check if it works). The problem is that the sign ">" appears in the following line and I can type any command but it wouldn't work, as if it was busy, but even if I wait for a long time (I've been waiting for hours) I get no result (the file output.psl is not even created). Since this all happens with just one sequence I didn't even try to align all the sequences I have.

ADD REPLY • link 7.7 years ago by alessandro_d • 0

0

Entering edit mode

If you just run the following command does it print some blat help to screen?

blat

If it does then can you run the following command and post the output of second command (ps one) here? Are the two HGNC* files in the blat directory?

./blat HGNC_database.txt HGNC_query.txt output.psl &

ps -ef | grep blat

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

I get the blat help is I type the command ./blat. This is what I get when I use the command you told me:

AirdiAlessandro:blat alessandro$ ps -ef | grep blat


  501  4833  4810   0  1:42pm ttys000    0:00.84 ./blat HGNC_database.txt HGNC_query.txt output.psl


501  4835  4810   0  1:42pm ttys000    0:00.01 grep blat

AirdiAlessandro:blat alessandro$ Loaded 1078834294 letters in 19071 sequences

AirdiAlessandro:blat alessandro$ Searched 2989 bases in 1 sequences

And yes, the two HGNC files are both in the blat directory.

ADD REPLY • link 7.7 years ago by alessandro_d • 0

0

Entering edit mode

Great the means the program is running.

What do following two commands show?

ls -lh output.psl

head output.psl

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

ls -lh output.psl
-rw-r--r--  1 alessandro  staff   560B 28 Lug 13:43 output.psl


head output.psl
    psLayout version 3

match   mis-    rep.    N's Q gap   Q gap   T gap   T gap   strand  Q           Q       Q       Q   T           T       T   T   block   blockSizes  qStarts  tStarts
        match   match       count   bases   count   bases           name        size    start   end name        size    starend count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
2989    0   0   0   0   0   0   0   +   ENSG00000003137|ENST00000001146 2989    0   2989    ENSG00000003137|ENST00000001146 18801   15812   18801   1   2989,   0,  15812,

ADD REPLY • link updated 7.7 years ago by GenoMax 141k • written 7.7 years ago by alessandro_d • 0

0

Entering edit mode

Excellent. Your first successful blat search.

Have you thought about how you are going to handle 80K searches and the results?

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Great!! I'm so happy, thank you very much for your precious help! Shall I now repeat the same commands for the real files?

No since this is my first time with blat I don't know how to handle the 80K sequences. Any advice?

ADD REPLY • link 7.7 years ago by alessandro_d • 0

0

Entering edit mode

Now for the nitty gritty.

What are you trying to achieve? 80K sequences are from 3'-UTR but what are you searching those against (or vice versa)? You need to start considering if you only want to keep the top hit (or look for additional ones)? Do you want to allow/disallow errors? What sort of coverage should the query have? In short all those questions that you would ask when doing sequence similarity searches. If you are only looking for perfect matches then your task may be simpler.

In any case, 80K queries are a large bunch and the analysis could take day(s). I notice that you have a laptop so plan on leaving it on. If you would rather do the searches in batches then split the queries into say 5K chunks and then run through the files one at a time.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

Do I have to decide all this before I run the blat command or can I think about it during the analysis after I get the results of the search?

ADD REPLY • link 7.7 years ago by alessandro_d • 0

0

Entering edit mode

Yes and no.

If you decide to do it later then the results you get will be using default options for blat (which may or may not be suitable for what you need).

Since it may take a while to go through the query list if you change your mind on the options you will either need to re-run the entire analysis again or parse the results differently (provided the information you need is captured in the results of the first run).

I suggest you use a small chunk of query sequences (say 500). Look at the results and then decide on a meaningful path forward.

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

I tried to align the first 500 sequences and it worked correctly. Then I used the command pslPretty in order to visualize the results and the file pretty.out was created but if I use che command open pretty.out it says no application is able to open it. What program should I use?

ADD REPLY • link 7.7 years ago by alessandro_d • 0

0

Entering edit mode

You should be able to look at the output file with more pretty.out or less pretty.out. It is a text file so it could also be opened in textedit/textwrangler etc.

ADD REPLY • link 7.7 years ago by GenoMax 141k