How to download just the genomes I want for blast+: currently using update_blastdb.pl
1
0
Entering edit mode
6.9 years ago
Jacob ▴ 10

right now I'm running

update_blastdb.pl --timeout 300 refseq_genomic.

But this takes up hundreds of GB on my computer. I'm wondering if there is a way to get just the genomes I want For example, if I just want the genomes for Gallus gallus, Mus musculus, and Homo sapiens how can I do something similar to get just those genomes.

Explain things if you can I'm pretty new at doing this and not very good at trying to link ftp databases to my blast searches.

blastn update_blastdb.pl refseq_genomic ftp • 3.3k views
ADD COMMENT
1
Entering edit mode
6.9 years ago
GenoMax 141k

Get those genomes (from NCBI genomes FTP site, (you could cat the chromosome files together) and build the blast index yourself using makeblastdb.

Otherwise UCSC has full fasta format genome files (as single file downloads, all chromosomes already in one file). For human, Mouse and Chicken. Making your own blast database is the same as above and is explained in this manual.

ADD COMMENT
0
Entering edit mode

Thank-you very much I've tried doing this method, but cannot execute it right and I do not know why

server:database user$ ~/homebrew/bin/wget https://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus
--2017-05-18 15:35:48--  https://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus
Resolving ftp.ncbi.nlm.nih.gov... 2607:f220:41e:250::7, 130.14.250.12
Connecting to ftp.ncbi.nlm.nih.gov|2607:f220:41e:250::7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3656 (3.6K) [text/html]
Saving to: ‘Mus_musculus’

Mus_musculus                                      100%[==========================================================================================================>]   3.57K  --.-KB/s    in 0s      

2017-05-18 15:35:48 (63.4 MB/s) - ‘Mus_musculus’ saved [3656/3656]

I then follow up this command with the following and get errors which I do not know how to tackle

server:database user$ cd ..

option 1

server:BlastFolder user$ makeblastdb -in database/Mus_musculus -out database/mouse_genome -dbtype nucl
Building a new DB, current time: 05/18/2017 15:38:55
New DB name:   /Users/user/Desktop/BlastFolder/database/mouse_genome
New DB title:  database/Mus_musculus
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: database/Mus_musculus does not match input format type, default input type is FASTA

option 2

 server:BlastFolder user$ makeblastdb -in database/Mus_musculus -out database/mouse_genome -dbtype nucl -input_type blastdb
    BLAST Database error: No alias or index file found for nucleotide database [database/Mus_musculus] in search path [/Users/user/Desktop/BlastFolder::]
ADD REPLY
1
Entering edit mode

Your command is wrong since it does not address a specific file.

I suggest that you use the UCSC links I provided to make your life simpler. The command in that case should be wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/galGal5/bigZips/galGal5.fa.gz

After you download the files you will need to gunzip/tar or tar -avf them to uncompress them. That will be followed by cating the three genome files together cat hg38.chromFa.fa mouse.fa chicken.fa > giant_genome.fa

Finally run mkblastdb -i giant_genome.fa etc to make the database.

Use real file names when cat'ing and appropriate options for mkblastdb when you run the final command.

Note: If you want to make separate databases for the three genomes then don't do the cat step.

ADD REPLY
0
Entering edit mode

Thank-you, a few questions though

Main problem I'm still getting an error with my makeblastdb command gunzip galGal5.fa.gz

I assume you meant to type -in because I have no -I option. When I use -in as you did I get this error

makeblastdb -in 'database/galGal5.fa'

USAGE
  makeblastdb [-h] [-help] [-in input_file] [-input_type type]
..
..
Error: Argument "dbtype". Mandatory value is missing:  `String, `nucl', `prot''
Error:  (CArgException::eNoArg) Argument "dbtype". Mandatory value is missing:  `String, `nucl', `prot''

When I add in some of these mandatory values I still get an error

server:BlastFolder user$ makeblastdb -in 'database/galGal5.fa' -out database/chicken_genome -dbtype nucl -input_type blastdb -title "Chicken_genome"

Building a new DB, current time: 05/18/2017 17:35:26
New DB name:   /Users/user/Desktop/BlastFolder/database/chicken_genome
New DB title:  Chicken_genome
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Error: [makeblastdb] Unable to open input database/galGal5.fa as BLAST db
BLAST Database error: No alias or index file found for nucleotide database [database/galGal5.fa] in search path [/Users/user/Desktop/BlastFolder::]

Extra questions.

If I want to get the genomes from the ncbi link I posted, how can I get the specific link

Is that suppose to be tar -avf ? My tar has no -a option

ADD REPLY
1
Entering edit mode

You will need to go into individual chromosome directories and get the *fa.gz file for each (e.g. Chr1 for Mouse).

Use the UCSC method above. It will save you a bunch of time. Sequence is identical no matter where you get it from.

If you need a primer for unix then I suggest that you spend some time at this site.

ADD REPLY
1
Entering edit mode

If I want to get the genomes from the ncbi link I posted, how can I get the specific link

Trying to extract genomes you need from blast index for nt or refseq_genomic would be a much more tedious undertaking. You can't do it on the fly so to speak. You will need to download the entire index locally and then do the extraction. The method I described here is more straightforward.

ADD REPLY
0
Entering edit mode

Thank-you so much for your help. I edited the comment because it still wasn't working, but I think I just need to change the dbtype to fasta

ADD REPLY

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6