Among other taxonomic groups, I want to download all hemiptera proteins from NCBI using the CLI tool ncbi-datasets-cli
v16.10.1 installed with conda v23.5.0.
I've tried using the following command, but get an error.
datasets download gene taxon 7524
Error: The taxonomy ID '7524' is valid for Hemiptera, but the command 'gene download by taxon' requires an at-or-below-species taxon
Alternatively, I can use the genome
function over gene
:
datasets download genome taxon 7524 --include protein
And whilst this works, it downloads only proteins associated with genome assemblies, getting ~930,000, rather than the ~1,400,000 listed on NCBI proteins.
I want to see if there is a significant difference in clustering and redundancy removal with MMseqs when constructing a database for these two similar datasets. I realise most of the additional proteins will be alleles of annotated genes. This is just a test dataset for a later larger project.
Regardless, is there a way to download all proteins from NCBI using a CLI tool?