7 months ago
From speaking with a few other pros, this was the solution in the end (though only very rough at the mo):
ete3 toolkit to get a list of IDs:
from ete3 import NCBITaxa
taxon_name = sys.argv
ncbi = NCBITaxa()
ebact = ncbi.get_descendant_taxa(taxon_name)
with open('./taxids', 'w') as ofh:
for i in ebact:
ofh.write("%s\n" % i)
# At this point, one could import ncbi-genome-download as a python method and continue
Which gave me a list of IDs (though this includes ALL descendent taxa, even ones without complete genomes etc).
I passed these to the latest version of
ncbi-genome-download which accepts a
--taxid 12345,65890 format for specifiying the IDs.
So I just ran:
for file in * ;
do python ~/bin/ncbi-genome-download/ncbi-genome-download-runner.py -l complete -v -p 10 --taxid $(paste -s -d ',' "$file") bacteria ;
I had to run this iteratively on many files after I split my
taxids file up as there is a limit to how many args can be passed to
--taxid at once.
EDIT Sept 2018:
I contributed a script to the
ncbi-genome-download repo to make getting the TaxIDs nice and easy. It uses the approach above, but there’s no need to rewrite it for oneself now.