This site is a beta test.
Question: Easiest way to download all Enterobacteria
1
Entering edit mode
20 months ago
jrj.healey 12k
United Kingdom

Does anyone have a simple solution to downloading all the refseq genomes for a particular taxon?

Using ncbi-genome-download its possible to specify the species or genus TaxIDs and download them, but apparently you can't go higher up the taxonomic ranks (even though enterobacteria has a TaxID of 543 for instance).

If anyone knows of a way to download all the Enterobacteria I'm all ears.

Alternatively, if there is a method of extracting the species TaxIDs from the Enterobacterial taxid in NCBI such that I can pass them all directly to ncbi-genome-download that would work too.

ADD COMMENTlink 20 months ago jrj.healey 12k • updated 20 months ago tdmurphy • 160
4
Entering edit mode
14 months ago
jrj.healey 12k
United Kingdom

From speaking with a few other pros, this was the solution in the end (though only very rough at the mo):

Use the ete3 toolkit to get a list of IDs:

from ete3 import NCBITaxa
import sys

taxon_name = sys.argv[1]

ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
ebact = ncbi.get_descendant_taxa(taxon_name)

with open('./taxids', 'w') as ofh:
    for i in ebact:
        ofh.write("%s\n" % i)

# At this point, one could import ncbi-genome-download as a python method and continue

Which gave me a list of IDs (though this includes ALL descendent taxa, even ones without complete genomes etc).

I passed these to the latest version of ncbi-genome-download which accepts a --taxid 12345,65890 format for specifiying the IDs.

So I just ran:

for file in * ; 
do python ~/bin/ncbi-genome-download/ncbi-genome-download-runner.py -l complete -v -p 10  --taxid $(paste -s -d ',' "$file") bacteria ; 
done

I had to run this iteratively on many files after I split my taxids file up as there is a limit to how many args can be passed to --taxid at once.

EDIT Sept 2018:

I contributed a script to the ncbi-genome-download repo to make getting the TaxIDs nice and easy. It uses the approach above, but there’s no need to rewrite it for oneself now.

ADD COMMENTlink 14 months ago jrj.healey 12k
Entering edit mode
1

Wow, thanks for this answer. I've just learned about this useful ncbi.get_descendant_taxa() funcion. Funny, I use the same variable name ofh for an output file and I always read it as output file handle.

ADD REPLYlink 20 months ago
a.zielezinski
8.6k
Entering edit mode
0

That’s exactly the way I intend it! It’s quite possible I’ve picked up the habit from some of your answers!

ADD REPLYlink 20 months ago
jrj.healey
12k
3
Entering edit mode
20 months ago
Sej Modha 4.2k
Glasgow, UK

Maybe this would do the trick!

esearch -db genome -query "txid543 [Organism]"|elink -target nuccore|efilter -query "RefSeq"|efetch -format fasta
ADD COMMENTlink 20 months ago Sej Modha 4.2k
0
Entering edit mode
20 months ago
tdmurphy • 160

You can easily do this from NCBI's Assembly resource: https://www.ncbi.nlm.nih.gov/assembly/?term=Enterobacteria%5Borgn%5D+latest_refseq%5Bfilter%5D

click the blue "Download Assemblies" button, pick "refseq" and the filetype you're after (e.g. genomic FASTA), and it should work. It might take a while for that many genomes.

ADD COMMENTlink 20 months ago tdmurphy • 160
Entering edit mode
0

Yeah, I should have sepecified I was after a command line tool, but this would work for sure

ADD REPLYlink 20 months ago
jrj.healey
12k

Login before adding your answer.

Powered by the version 1.5.2