Download All The Bacterial Genomes From Ncbi
8
15
Entering edit mode
11.3 years ago
rehma.ar ▴ 290

Dear all!

i want to download all the bacterial genomes from NCBI. when i check the number of available genomes at NCBI at this link http://www.ncbi.nlm.nih.gov/genome/browse/ it shows the total number of bacterial genomes as 3791. but when i download them from ftp-site ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ using this command wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz it downloads only less than 2300 genomes.

can anyone tell me why is that, and how can i download all of them?

ncbi • 57k views
ADD COMMENT
4
Entering edit mode

A lot of genomes don't have any data. Look at the Chr column in the table, if there is no number then no sequence is available.

ADD REPLY
14
Entering edit mode
8.1 years ago
kristjan ▴ 170

NCBI has moved complete bacterial genomes file in their ftp site to ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/ where it is not updated anymore. Do you know the reason? And how is it possible to download the most recent complete genomes as a whole fasta file?

ADD COMMENT
8
Entering edit mode

It's not possible to download the most recent complete bacterial genomes as one fasta file.

What you can do is:

  1. Get the list of assemblies: wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

  2. Parse the addresses of complete genomes from it (right now n = 4,804):

    awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > assembly_summary_complete_genomes.txt
    
  3. Make a dir for data

    mkdir GbBac
    
  4. Fetch data

    for next in $(cat assembly_summary_complete_genomes.txt); do wget -P GbBac "$next"/*genomic.fna.gz; done
    
  5. Extract data

    gunzip GbBac/*.gz
    
  6. Concatenate data

    cat GbBac/*.fna > all_complete_Gb_bac.fasta
    

edit. Where can I read about the recent changes to post formatting @ biostars?

ADD REPLY
3
Entering edit mode

Slightly different reply from 5heikki above -- this includes all bacterial sequences, complete and incomplete.

Here is my recipe, adapted from Case 1 in this document: ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

-- at Bash/Mac OSX prompt in the desired directory:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | \ awk '{FS="\t”}  \!/^#/ {print $20} '  | \ sed ‐r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA_.+)|\1\2/\2_genomic.fna.gz|' >genomic_file

-- final command, in the same directory, where you want to install the files:

wget -i genomic_file

Genbank is where all current complete and incomplete sequences are being stored and updated since Dec 12, 2015. Note that if you want a different taxonomic branch, you have to look at the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank) and replace "bacteria" in the ftp address above with the folder you'd like. To get only Refseq genomes (complete and canonically curated sequences), replace "genbank" with "refseq". (This is also described in the factsheet.) "NOTE: if you need the assembly submiƩed to GenBank, you will need to change the curl command’s “refseq” to “genbank”. Since these assembly’s accession iniƟal are different, you will need change sed command’s “GCF” to “GCA”

If copying and pasting the lines above, or from the factsheet, gives you an error, I would paste them into a text editor with syntax highlighting (Emacs, textedit, etc.) and re-type any weirdly colored characters, and quotes on principle (right-slanted double quotes are interpreted differently than left-slanted, etc.).

Certain Bash prompts may require an escape character ("\") for special characters used within AWK commands. If you still have errors, as a second round of treatment, try removing the escape \ in front of the !.

Finally, the folder content is not a final database for standalone BLAST. (I used blastn, part of the Entrez Direct suite of tools provided by NCBI). You will have to use makeblastdb (included if you downloaded the suite to get blastn) to alter the format of and index the files for use by blastn. Furthermore (I had to write to NCBI about this too), the number of files in such a complete taxonomic database is too much for makeblastdb to handle. However, if you cat them into one file, it's fine!

cat *.fna > all_bacteria_fna_files.fna

makeblastdb -in all_bacteria_fna_files.fna -parse_seqids -dbtype nucl -title bacteria -out bacteria

Then, you have to make sure blastn has the folder containing the new database designated as a database variable.

export $BLASTDB=":$HOME/genomes/bacteria/genbank_2_3_2016"

Then you can run blastn on your new bacterial database. (Or, as you can see, this should work with any complete taxonomic download.) Good luck!!

ADD REPLY
0
Entering edit mode

I got an error message with your command:

  0 24.0M    0 15928    0     0   9102      0  0:46:15  0:00:01  0:46:14  9101curl: (23) Failed writing body (0 != 2896)
ADD REPLY
0
Entering edit mode

Looks like you don't have write permission in the directory in which you're executing the curl command..

ADD REPLY
0
Entering edit mode

For those from the future: have a look at "cseto" complement down there regarding updates in the .fna links and avoid spending hours of your precious time trying to fix it (like me =)). The above sed command is now out of date because NCBI changed link adresses.

ADD REPLY
0
Entering edit mode

I had to experiment with rattus8's response above because I am working on a MacBook Pro and the extended set of regular expressions requires download of gnu sed (I used homebrew to brew install gsed) and there are some syntax differences.... (NB- I also installed gnu awk using brew install gawk). Here is the command for establishing a proper file for wget download of bacterial refseq as of today:

sudo curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt' | gawk 'BEGIN{FS="\t";} /^#/ {next} {print $20}' | gsed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/.+/)(GCF_.+)|\1\2\/\2_genomic.fna.gz|' > refseq_file

after which, all you need to do is:

wget -i refseq_file

as rattus8 described above...

Enjoy!

ADD REPLY
0
Entering edit mode

Hi, I used above described method to download all bacterial genomes present in refseq in the form of *.fna.gz but i also want to get *.gff.gz file for RNA-seq analysis. Please help me in this regard.

ADD REPLY
1
Entering edit mode

4) can be made faster with xargs and 8 parallel jobs like this:

cat assembly_summary_complete_genomes.txt | xargs -I{} -n1 -P8 wget -P GbBac {}/*_genomic.fna.gz

thanks @kristian for the very clear toturial!

ADD REPLY
0
Entering edit mode

This was very helpful. Thanks 5heikki.

ADD REPLY
0
Entering edit mode

It looks like it is still possible to download the most recent complete bacterial genomes as very few FASTA files from here:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria.*.genomic.fna.gz

which should be very straightforward to concatenate them in one big fasta file using zcat.

It is not clear to me though what is the difference between bacterial genomes from the above link and ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/

ADD REPLY
7
Entering edit mode
7.2 years ago

I know that this question is already 4 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve all bacterial reference genomes from several database sources one can simply type:

# download all bacterial reference genomes from NCBI RefSeq
biomartr::meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome")

or

# download all bacterial reference genomes from NCBI Genbank
biomartr::meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome")

Alternatively, you can also specify: type = "proteome", type = "CDS" (coding sequence) or type = "gff".

For more details about downloading specific genomes from specific kingdoms or subkingdoms of life please consult the Meta-Genome Retrieval vignette.

Please note that to promote computational reproducibility in genomics and metagenomics studies, biomartr stores log files for each downloaded genome, proteome, or CDS file.

An example log file looks as follows:

File Name: Escherichia_coli_genomic_refseq.fna.gz

Organism Name: Escherichia_coli

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

Download_Date: Wed Feb 15 15:17:50 2017

refseq_category: reference genome

assembly_accession: GCF_000005845.2

bioproject: PRJNA57779

biosample: SAMN02604091

taxid: 511145

infraspecific_name: strain=K-12 substr. MG1655

version_status: latest

release_type: Major

genome_rep: Full

seq_rel_date: 2013-09-26

submitter: Univ. Wisconsin

I hope this helps.

ADD COMMENT
0
Entering edit mode

Hi, So what if I want to specifically download all genomes available for a bacterial family (Pasteurellaceae)? Regards Ahmed

ADD REPLY
0
Entering edit mode

Many thanks for pointing out to me that this functionality might be useful. I sat down and extended the functionality of the meta.retrieval() function which now allows you to specify the "group" argument in addition to the "kingdom" argument. This way, you can download subgroups of species. Unfortunately, NCBI does not provide the family information in their assembly report files that I parse to automatically retrieve the download paths for particular species (only kingdom, group, and subgroup information are available). However, if I am not mistaken, then Pasteurellaceae are members of the class "Gammaproteobacteria". Thus, with biomartr you could now retrieve all bacterial genomes, proteomes, CDS, and gff files that belong to the class "Gammaproteobacteria" as follows:

# retrieve all genomes belonging to Gammaproteobacteria from NCBI RefSeq
meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "refseq", type = "genome")

Please note that for this new functionality of biomartr you need to install the developer version from GitHub (e.g. via the devtools package). In the next CRAN submission, this new functionality will be available.

I also updated the Meta-Genome Retrieval vignette and added some examples of how to retrieve genomes from subgroups of kingdoms. So you might also consult this vignette for more details. I hope this helps you and I am always happy to receive feedback on potential extensions or new features that I could implement into biomartr.

ADD REPLY
3
Entering edit mode
11.3 years ago
Rahul Sharma ▴ 660

Hi,

How many sequences are you getting with this wget command? On the mentioned link only 2379 of bacterial species have genomic DNA. Click on the "Download selected records" and use awk -F"\t" '$5>0' genomes_overview.txt | wc -l.

Best wishes, Rahul

ADD COMMENT
1
Entering edit mode

thanks for responding. yes that's right it gives 2379 but i can only download 2258 with the above mentioned command.

ADD REPLY
3
Entering edit mode
11.3 years ago
Josh Herr 5.8k

Just adding to what is already here: You are probably able to download "all" of the bacterial genome data that has been released by NCBI.

While NCBI may list 3791 bacterial genomes, these genomes are in various states of completion (actually most genomes are still "drafts" for many many years, if ever designated as non-draft state). It's my understanding that NCBI-listed bacterial genome projects may be recorded during any stage of production (with intent to sequence, sequencing in progress, or in a stage of assembly, annotation, etc.), and you may not be able to download "all" of the "available" genomes in a draft state. Try searching NCBI or elsewhere for contigs for yet fully released genomes. The number of available genomes can change on a day to day basis when NCBI is updating genome drafts, updating servers, moving data from one server to another, so the number of available genomes is in a contant state of flux: so if you wget from the FTP site the file you download may differ from day to day.

I've found that the GOLD database is a good place to check on the status of a specific genome sequencing project.

ADD COMMENT
2
Entering edit mode
9.5 years ago
Denise CS ★ 5.2k

Ensembl Bacteria has got >15,000 bacterial genomes annotated in the INSDC assembly database as complete. There are gene models too. In the next release of Ensembl Genome, the number will go up to 20,000.

ADD COMMENT
2
Entering edit mode
7.5 years ago
ctseto ▴ 310

Slight change to the syntax required for those pulling from bacteria.

From example output, the directory structure :

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000149845.2_SJ5/GCF_000149845.2_SJ5_genomic.fna.gz

An example of column 20 from the bacteria assembly summary:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/183/245/GCA_000183245.1_ASM18324v1/GCA_000183245.1_ASM18324v1_genomic.fna.gz

To account for the change from /all/GCF...../GCF...._genomic.fna.gz to GCA/[...]/[...]/[...]/

New proposed version of the one-liner to construct the URL's for the genomic.fna files is:

curl 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt' | awk '{FS="\t"} !/^#/ {print $20}' | sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA/)([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' > genomic_file

Having never really used sed like this before, some headscratching took place before I got things working.

ADD COMMENT
1
Entering edit mode

Oh my!!! So many thanks! Spent hours trying to figure it out.

ADD REPLY
0
Entering edit mode
9.5 years ago
Darko.K • 0

Hi everybody,

i'm looking to download all complete bacterial genomes. There's a option with http://www.ncbi.nlm.nih.gov/genome/browse/ to show only complete prokaryotic genomes (3243) , and i'm interested in downloading just these. ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/ provides the possibility to download everything, but thats not what i'm looking for. DOes someone know a possibility for that ? Thank you all !

ADD COMMENT
0
Entering edit mode
9.5 years ago
Darko.K • 0

How complete are these > 15.000 genomes ? And is there a possibility provided to download all genomes in FAST(DNA) format with one click ?

ADD COMMENT
0
Entering edit mode

They are annotated in the INSDC (e.g ENA, European Nucleotide Archive) as a containing the full genome representation with cds annotations for example. You may want to contact ENA for further details on completeness. To download all in one go try wget on ftp://ftp.ensemblgenomes.org/pub/current/bacteria/fasta.

ADD REPLY

Login before adding your answer.

Traffic: 2329 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6