Using 'blastn' to do a remote search on NCBI BLAST against the nt collection database
0
0
Entering edit mode
6.2 years ago
maciwuk • 0

I have 60,000 sequences that I want to BLAST against the default 'Nucleotide collection (nt/nr)' database.

Is it possible to do this without setting up a standalone, local version of BLAST? (I of course have BLAST (blast+-2.6.0) installed, but I wonder if it is possible to run the search non-locally).

blastn -db nt -query input-sequences.fasta -remote -out blast_output.out

I get quite a huge list of errors that contain strings such as: Unavailable feature GNUTLS, Failed to initialize secure session, Service not found, stack is empty, etc.

QUESTIONS:

  1. Am I doing something wrong in my command?
  2. Is it faster to build a local database and search locally on my own computer for such a large number of sequences?
blast shell unix linux • 9.5k views
ADD COMMENT
1
Entering edit mode

I don't recollect if v. 2.6.0 moved to using https connections. NCBI has completely moved to using https for all connectivity so upgrading to latest blast v. 2.7.1 may not be a bad idea.

If you need to blast 60K sequences then consider doing those in chunks. You don't want to abuse your privileges at NCBI by sending a massive amount of blast searches their way. Consider using a loop/building in sleep times etc.

If you have enough local resources available then doing the search locally will give you more control over things.

ADD REPLY
0
Entering edit mode

Great. I will update to 2.7.1 and will try again. I have access to a computer with 192 GB RAM and 12 physical CPU cores (each @2.2 GHz). Do you think BLASTing 60 thousand sequences will take a substantial amount of time?

ADD REPLY
1
Entering edit mode

What kind of sequences are these? NGS or regular fasta? You may want to use DIAMOND (since you have enough resources available locally) instead of blast. That can speed things up significantly.

ADD REPLY
0
Entering edit mode

These are short DNA sequences (all between 15-30 nt) extracted directly from UCSC.hg38 and UCSC.mm10 fasta (chromosome) files. They have some modifications introduced, where usually one nucleotide is either replaced by 'H' (not G) or 'N' (any nucleotide). Supposing a certain sequence is from chromosome 1 on hg38, I want to know whether my sequence with the modification can be found on a chromosome other than chr1. I simply want to do a BLAST search to see if I can match any of these sequences to any other chromosomes with 100% similarity where that matched hit is NOT the chromosome my sequence was originally found on. The reason BLAST impeccably fits this situation is that it can (1) optimize the sequence and cut few nucleotide from each end (and that is exactly what I want too, because I am also interested in shorter arms in both ends of the sequence, so cutting a few nucleotides from each end is more than fine), and (2) BLAST is totally fine with 'N' and 'H' nucleotides I have introduced in my sequences and it is capable of dealing with those in a way that is highly applicable to my end-goal. For this reason, I thought BLAST will be even faster than a regular expression search. Though I am still not sure whether I should do it locally!

ADD REPLY
1
Entering edit mode

Ah sorry then DIAMOND would not be an option. I suggest doing blastn search locally against a smaller subset (mouse and human genomes) than entire nt. That will help speed things up.

Remember to use --task blastn-short since you have short sequences.

Edit: Blat from UCSC may be very fast but I am not sure if it will handle IUPAC codes. Look into it as well.

ADD REPLY

Login before adding your answer.

Traffic: 1858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6