Retrieve species name using taxaIDs of NCBI
4
0
Entering edit mode
7.5 years ago
chetana ▴ 60

Hi everyone,

I have a long list of taxaID, I want to map them to get the scientific names (species) and also lineage. I have looked at the names.dmp file that maps the taxaID and names. Tried to pull the ones I wanted using python but the names.dmp file has multiple rows for particular taxaID and I only need scientific name. So I'm not sure how to proceed with this. I've even tried the Entrez efetch but I guess it needs an xml input, I just have .txt file with a list of TaxaIDs in there. I'm quite new to Bioinformatics any help and suggestions are appreciated. Thanks in advance!

python biopython • 8.0k views
ADD COMMENT
0
Entering edit mode

The file 'names.dmp' has four columns. The first column is the taxid, the second column is a name, and the fourth column is the class of the name. A taxid may have assigned several names but each of these names has a different 'class'. Every taxid has exactly one name of class 'scientific name', while the other classes are optional. Thus you can restrict your search to lines having 'scientific name' in the forth column. Please compare the output of these two awk searches:

awk -F '|' '$1==9606' names.dmp

awk -F '|' '$1==9606 && $4~/scientific name/' names.dmp

Unfortunately, 'names.dmp' is a bit nasty to parse due to abundant and unnesserary white space in it.

ADD REPLY
5
Entering edit mode
7.4 years ago

Try TaxonKit (Cross-platform and Efficient NCBI Taxonomy Toolkit) with the lineage subcommand (usage which querys full lineage of given taxids from file.

TaxonKit is a command-line tool written in Go programming language, executable binary files for most popular operating system are freely available in download page. Just download compressed executable file of your operating system, uncompress it and run.

It's very fast!

NCBI taxonomy data is needed: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Example data:

$ cat t.taxid
349741
834

Query lineage:

$ taxonkit lineage --nodes nodes.dmp --names names.dmp  t.taxid
349741  cellular organisms;cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
834     cellular organisms;cellular organisms;Bacteria;FCB group;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes;Fibrobacter succinogenes subsp. succinogenes

Qiime-like format can be obtained by flag -f:

$ taxonkit lineage --nodes nodes.dmp --names names.dmp -f t.taxid
349741  k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila
834     k__Bacteria;p__Fibrobacteres;c__Fibrobacteria;o__Fibrobacterales;f__Fibrobacteraceae;g__Fibrobacter;s__Fibrobacter succinogenes;S__Fibrobacter succinogenes subsp. succinogenes

You can also extract custom levels of rank with reformat (usage). The default format is {k};{p};{c};{o};{f};{g};{s}:

$ taxonkit lineage --nodes nodes.dmp --names names.dmp t.taxid | cut -f 2 | taxonkit reformat | cut -f 2
Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
Bacteria;Fibrobacteres;Fibrobacteria;Fibrobacterales;Fibrobacteraceae;Fibrobacter;Fibrobacter succinogenes
ADD COMMENT
2
Entering edit mode
7.4 years ago

I'm adding a Python solution that uses Biopython. I feel that although wordier, it is more scalable and readable than the concatenation of pipes. You just need to specify the filename with your tax IDs; here, I've used human and cat IDs as an example:

The output can be dumped to a file and read as a CSV:

Homo sapiens,cellular organisms >  Eukaryota >  Opisthokonta >  Metazoa >  Eumetazoa >  Bilateria >  Deuterostomia >  Chordata >  Craniata >  Vertebrata >  Gnathostomata >  Teleostomi >  Euteleostomi >  Sarcopterygii >  Dipnotetrapodomorpha >  Tetrapoda >  Amniota >  Mammalia >  Theria >  Eutheria >  Boreoeutheria >  Euarchontoglires >  Primates >  Haplorrhini >  Simiiformes >  Catarrhini >  Hominoidea >  Hominidae >  Homininae >  Homo
Felis catus,cellular organisms >  Eukaryota >  Opisthokonta >  Metazoa >  Eumetazoa >  Bilateria >  Deuterostomia >  Chordata >  Craniata >  Vertebrata >  Gnathostomata >  Teleostomi >  Euteleostomi >  Sarcopterygii >  Dipnotetrapodomorpha >  Tetrapoda >  Amniota >  Mammalia >  Theria >  Eutheria >  Boreoeutheria >  Laurasiatheria >  Carnivora >  Feliformia >  Felidae >  Felinae >  Felis

Cheers!

ADD COMMENT
1
Entering edit mode
7.5 years ago
Prasad ★ 1.6k

efetch does not need a xml input. here is the linux command line solution,

for i in `cat file`; do curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=$i&rettype=docsum&retmode=text" | head -1 | sed -e 's/1. //g' | awk -F "\t" '{print '$i'"\t"$0}'; done;

where file is file with all the taxa ID one per line

ADD COMMENT
0
Entering edit mode

Thanks for the reply Prasad, it worked. Is there a way I can get full lineage using TaxaIDs? Thank you.

ADD REPLY
0
Entering edit mode

just remove the rettype and retmode from efetch link which gives you xml from there you can parse full lineage

ADD REPLY
0
Entering edit mode

for example sake,

curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=9606" | grep -iw lineage | perl -ne '{if(/.*?\>(.*?)\<\/Lineage\>/){print $1,"\n";}}'
ADD REPLY
0
Entering edit mode
7.1 years ago
-_- ★ 1.1k

I converted the whole taxdump into a csv file of lineages, each identified by a tax id, https://github.com/zyxue/ncbitax2lin. You may find it helpful.

ADD COMMENT

Login before adding your answer.

Traffic: 2478 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6