Question

NCBI CLI Download all proteins from Taxid

0

Entering edit mode

4 weeks ago

dthorbur ★ 1.9k

Among other taxonomic groups, I want to download all hemiptera proteins from NCBI using the CLI tool ncbi-datasets-cli v16.10.1 installed with conda v23.5.0.

I've tried using the following command, but get an error.

datasets download gene taxon 7524

Error: The taxonomy ID '7524' is valid for Hemiptera, but the command 'gene download by taxon' requires an at-or-below-species taxon

Alternatively, I can use the genome function over gene:

datasets download genome taxon 7524 --include protein

And whilst this works, it downloads only proteins associated with genome assemblies, getting ~930,000, rather than the ~1,400,000 listed on NCBI proteins.

I want to see if there is a significant difference in clustering and redundancy removal with MMseqs when constructing a database for these two similar datasets. I realise most of the additional proteins will be alleles of annotated genes. This is just a test dataset for a later larger project.

Regardless, is there a way to download all proteins from NCBI using a CLI tool?

ncbi • 180 views

ADD COMMENT • link updated 4 weeks ago by GenoMax 142k • written 4 weeks ago by dthorbur ★ 1.9k

score 3 · Accepted Answer · 2024-04-02

You can use EntrezDirect as one option. This should fetch 1466558 sequences as of today.

$ esearch -db protein -query "hemiptera" | efetch -format fasta > file.fa
>sp|A0A7D0AGU9.1|TPS_MATON RecName: Full=Terpene synthase; Short=EoTPS
MEGLVNNSGDKDLDEKLLQPFTYILQVPGKQIRAKLAHAFNYWLKIPNDKLNIVGEIIQMLHNSSLLIDD
IQDNSILRRGIPVAHSIYGVASTINAANYVIFLAVEKVLRLEHPEATRVCIDQLLELHRGQGIEIYWRDN
FQCPSEDEYKLMTIRKTGGLFMLAIRLMQLFSESDADFTKLAGILGLYFQIRDDYCNLCLQEYSENKSFC

or you could get the species level taxID's using a utility program included in blast+ distribution which then would allow you to use datasets.

$ get_species_taxids.sh -t 7524 > taxidlist