get taxonomy lineage from organism's name
5
1
Entering edit mode
9.5 years ago
luanax85 ▴ 20

Hello everybody,

I would like to know if tehre's any chance to get the whole taxonomy lineage of multiple organism simply using an array of their names.. do you know any useful tool or script?

I know I should use the eutils from ncbi but at the time I'm no able to!

taxonomy • 9.3k views
ADD COMMENT
0
Entering edit mode

That's very helpful! but can I get multiple lineage at same time? Because I have 3000 organism to probe it's impossible to run this command one by one!

ADD REPLY
4
Entering edit mode
9.5 years ago

use NCBI e-utils: find the TAXON-ID of the organism, download its' XML record and extract the tag 'lineage'

xmllint --xpath "/eSearchResult/IdList/Id/text()" "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=Homo Sapiens[SCIN]" | xargs -I TAXON xmllint  --xpath "/TaxaSet/Taxon/Lineage/text()" "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=TAXON&retmode=xml&rettype=full"

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo
ADD COMMENT
2
Entering edit mode
ADD COMMENT
2
Entering edit mode
6.0 years ago
magic136 ▴ 20

Just try this NCBI website.

  1. input the list of multiple organism names
  2. select the option of full taxid lineage
  3. select "show on screen" or "save in file"
ADD COMMENT
1
Entering edit mode
9.5 years ago
Chris S. ▴ 320

You could also use the REST URLs at the ENA. The Taxon: can be a name or id like

http://www.ebi.ac.uk/ena/data/view/Taxon:Homo%20sapiens&display=xml

You can also paste a comma-separated list of names like

http://www.ebi.ac.uk/ena/data/view/Taxon:Homo%20sapiens,Taxon:Caenorhabditis%20elegans&display=xml

You'll have to test how many names you can pass to a single URL. In R you could try this..

tax <- c("Homo sapiens", "Caenorhabditis elegans")
x <- paste(paste("Taxon:", tax, sep=""), collapse=",")
x
[1] "Taxon:Homo sapiens,Taxon:Caenorhabditis elegans"

url <- paste( "http://www.ebi.ac.uk/ena/data/view/", x, "&display=xml", sep="")
doc <- xmlParse(url)

xpathSApply(doc, "/ROOT/taxon",  xmlGetAttr, "scientificName")
[1] "Caenorhabditis elegans" "Homo sapiens"

x <- getNodeSet(doc, "/ROOT/taxon")
# full lineage
sapply(x, function(y) paste(rev(xpathSApply(y, ".//lineage/taxon", xmlGetAttr, "scientificName")), collapse="; ") )

# ranks only
sapply(x, function(y) paste(rev(xpathSApply(y, ".//lineage/taxon[@rank]", xmlGetAttr, "scientificName")), collapse="; ") )

[1] "Eukaryota; Metazoa; Nematoda; Chromadorea; Rhabditida; Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis"
[2] "Eukaryota; Metazoa; Chordata; Craniata; Mammalia; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo"
ADD COMMENT
0
Entering edit mode
7.7 years ago
-_- ★ 1.1k

The whole NCBI taxonomy database is not that big. I have written some code to convert NCBI taxdump into lineages, https://github.com/zyxue/ncbitax2lin. You may find it useful.

ADD COMMENT
0
Entering edit mode

Hi, I have been using the lineages.csv file that you've generated as a lookup table to pull the lineages of the list of Taxa IDs I have. Some TaxaIDs dont return lineages when I try to extract them using the script I wrote. But when I put the same list of IDs on the NCBI taxonomy website they are mapped to another TaxaID called the primary taxa. So I was wondering if the lineages.csv is not updated. I've tried generating new lineages.csv but the software gives out this error:

Traceback (most recent call last):
  File "ncbitax2lin.py", line 224, in <module>
    main()
  File "ncbitax2lin.py", line 192, in main
    lineages_df.sort_values('tax_id', inplace=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2083, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'sort_values'
Makefile:2: recipe for target 'ncbitax2lin' failed
make: *** [ncbitax2lin] Error 1

I'm not that familiar with pandas and a beginner in python. I actually want to get the lineage for the list of TaxaIDs I have. Any suggestions will be appreciated. Thanks.

ADD REPLY
0
Entering edit mode

Sorry for late reply. As for the error, probably your pandas version is not new enough. http://stackoverflow.com/questions/34499728/dataframe-object-has-no-attribute-sort-values.

ADD REPLY
0
Entering edit mode

I have updated the lineage.csv.gz, and also the regeneration process more clear. See the updated README file if interested.

ADD REPLY

Login before adding your answer.

Traffic: 2739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6