Question

Easiest way to download all Enterobacteria

1

Entering edit mode

6.1 years ago

Joe 21k

Does anyone have a simple solution to downloading all the refseq genomes for a particular taxon?

Using ncbi-genome-download its possible to specify the species or genus TaxIDs and download them, but apparently you can't go higher up the taxonomic ranks (even though enterobacteria has a TaxID of 543 for instance).

If anyone knows of a way to download all the Enterobacteria I'm all ears.

Alternatively, if there is a method of extracting the species TaxIDs from the Enterobacterial taxid in NCBI such that I can pass them all directly to ncbi-genome-download that would work too.

ncbi genome refseq • 2.7k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 6.1 years ago by Joe 21k

0

Entering edit mode

6.1 years ago

tdmurphy ▴ 190

You can easily do this from NCBI's Assembly resource: https://www.ncbi.nlm.nih.gov/assembly/?term=Enterobacteria%5Borgn%5D+latest_refseq%5Bfilter%5D

click the blue "Download Assemblies" button, pick "refseq" and the filetype you're after (e.g. genomic FASTA), and it should work. It might take a while for that many genomes.

ADD COMMENT • link 6.1 years ago by tdmurphy ▴ 190

0

Entering edit mode

Yeah, I should have sepecified I was after a command line tool, but this would work for sure

ADD REPLY • link 6.1 years ago by Joe 21k

score 4 · Accepted Answer · 2018-03-07

4

Entering edit mode

6.1 years ago

Sej Modha 5.3k

Maybe this would do the trick!

esearch -db genome -query "txid543 [Organism]"|elink -target nuccore|efilter -query "RefSeq"|efetch -format fasta

ADD COMMENT • link 6.1 years ago by Sej Modha 5.3k

score 4 · Accepted Answer · 2018-03-08

From speaking with a few other pros, this was the solution in the end (though only very rough at the mo):

Use the ete3 toolkit to get a list of IDs:

from ete3 import NCBITaxa
import sys

taxon_name = sys.argv[1]

ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
ebact = ncbi.get_descendant_taxa(taxon_name)

with open('./taxids', 'w') as ofh:
    for i in ebact:
        ofh.write("%s\n" % i)

# At this point, one could import ncbi-genome-download as a python method and continue

Which gave me a list of IDs (though this includes ALL descendent taxa, even ones without complete genomes etc).

I passed these to the latest version of ncbi-genome-download which accepts a --taxid 12345,65890 format for specifiying the IDs.

So I just ran:

for file in * ; 
do python ~/bin/ncbi-genome-download/ncbi-genome-download-runner.py -l complete -v -p 10  --taxid $(paste -s -d ',' "$file") bacteria ; 
done

I had to run this iteratively on many files after I split my taxids file up as there is a limit to how many args can be passed to --taxid at once.

EDIT Sept 2018:

I contributed a script to the ncbi-genome-download repo to make getting the TaxIDs nice and easy. It uses the approach above, but there’s no need to rewrite it for oneself now.