Biostar Beta. Not for public use.
Retrieve species name using taxaIDs of NCBI
0
Entering edit mode
18 months ago
chetana • 40
San Diego

Hi everyone,

I have a long list of taxaID, I want to map them to get the scientific names (species) and also lineage. I have looked at the names.dmp file that maps the taxaID and names. Tried to pull the ones I wanted using python but the names.dmp file has multiple rows for particular taxaID and I only need scientific name. So I'm not sure how to proceed with this. I've even tried the Entrez efetch but I guess it needs an xml input, I just have .txt file with a list of TaxaIDs in there. I'm quite new to Bioinformatics any help and suggestions are appreciated. Thanks in advance!

0
Entering edit mode

The file 'names.dmp' has four columns. The first column is the taxid, the second column is a name, and the fourth column is the class of the name. A taxid may have assigned several names but each of these names has a different 'class'. Every taxid has exactly one name of class 'scientific name', while the other classes are optional. Thus you can restrict your search to lines having 'scientific name' in the forth column. Please compare the output of these two awk searches:

awk -F '|' '$1==9606' names.dmp awk -F '|' '$1==9606 && $4~/scientific name/' names.dmp Unfortunately, 'names.dmp' is a bit nasty to parse due to abundant and unnesserary white space in it. ADD REPLYlink 2 Entering edit mode 21 months ago Buenos Aires, Argentina I'm adding a Python solution that uses Biopython. I feel that although wordier, it is more scalable and readable than the concatenation of pipes. You just need to specify the filename with your tax IDs; here, I've used human and cat IDs as an example: The output can be dumped to a file and read as a CSV: Homo sapiens,cellular organisms > Eukaryota > Opisthokonta > Metazoa > Eumetazoa > Bilateria > Deuterostomia > Chordata > Craniata > Vertebrata > Gnathostomata > Teleostomi > Euteleostomi > Sarcopterygii > Dipnotetrapodomorpha > Tetrapoda > Amniota > Mammalia > Theria > Eutheria > Boreoeutheria > Euarchontoglires > Primates > Haplorrhini > Simiiformes > Catarrhini > Hominoidea > Hominidae > Homininae > Homo Felis catus,cellular organisms > Eukaryota > Opisthokonta > Metazoa > Eumetazoa > Bilateria > Deuterostomia > Chordata > Craniata > Vertebrata > Gnathostomata > Teleostomi > Euteleostomi > Sarcopterygii > Dipnotetrapodomorpha > Tetrapoda > Amniota > Mammalia > Theria > Eutheria > Boreoeutheria > Laurasiatheria > Carnivora > Feliformia > Felidae > Felinae > Felis  Cheers! ADD COMMENTlink 1 Entering edit mode 2.7 years ago Prasad ♦ 1.6k India efetch does not need a xml input. here is the linux command line solution, for i in cat file; do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=$i&rettype=docsum&retmode=text" | head -1 | sed -e 's/1. //g' | awk -F "\t" '{print '$i'"\t"$0}'; done;


where file is file with all the taxa ID one per line

0
Entering edit mode

Thanks for the reply Prasad, it worked. Is there a way I can get full lineage using TaxaIDs? Thank you.

0
Entering edit mode

just remove the rettype and retmode from efetch link which gives you xml from there you can parse full lineage

0
Entering edit mode

for example sake,

curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=9606" | grep -iw lineage | perl -ne '{if(/.*?\>(.*?)\<\/Lineage\>/){print $1,"\n";}}'  ADD REPLYlink 1 Entering edit mode 16 months ago China Try TaxonKit (Cross-platform and Efficient NCBI Taxonomy Toolkit) with the lineage subcommand (usage which querys full lineage of given taxids from file. TaxonKit is a command-line tool written in Go programming language, executable binary files for most popular operating system are freely available in download page. Just download compressed executable file of your operating system, uncompress it and run. It's very fast! NCBI taxonomy data is needed: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz Example data: $ cat t.taxid
349741
834


Query lineage:

$taxonkit lineage --nodes nodes.dmp --names names.dmp t.taxid 349741 cellular organisms;cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 834 cellular organisms;cellular organisms;Bacteria;FCB group;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes;Fibrobacter succinogenes subsp. succinogenes  Qiime-like format can be obtained by flag -f: $ taxonkit lineage --nodes nodes.dmp --names names.dmp -f t.taxid
349741  k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila
834     k__Bacteria;p__Fibrobacteres;c__Fibrobacteria;o__Fibrobacterales;f__Fibrobacteraceae;g__Fibrobacter;s__Fibrobacter succinogenes;S__Fibrobacter succinogenes subsp. succinogenes


You can also extract custom levels of rank with reformat (usage). The default format is {k};{p};{c};{o};{f};{g};{s}:

\$ taxonkit lineage --nodes nodes.dmp --names names.dmp t.taxid | cut -f 2 | taxonkit reformat | cut -f 2
Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Bacteria;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes

0
Entering edit mode
16 months ago
-_- • 830