Question

get taxonomy lineage from organism's name

1

Entering edit mode

9.5 years ago

luanax85 ▴ 20

Hello everybody,

I would like to know if tehre's any chance to get the whole taxonomy lineage of multiple organism simply using an array of their names.. do you know any useful tool or script?

I know I should use the eutils from ncbi but at the time I'm no able to!

taxonomy • 9.3k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by luanax85 ▴ 20

0

Entering edit mode

That's very helpful! but can I get multiple lineage at same time? Because I have 3000 organism to probe it's impossible to run this command one by one!

ADD REPLY • link 9.5 years ago by luanax85 ▴ 20

0

Entering edit mode

loop: http://www.linuxquestions.org/questions/programming-9/bash-read-entire-file-line-in-for-loop-240016/

ADD REPLY • link 9.5 years ago by Pierre Lindenbaum 161k

score 4 · Answer 1 · 2014-10-22

use NCBI e-utils: find the TAXON-ID of the organism, download its' XML record and extract the tag 'lineage'

xmllint --xpath "/eSearchResult/IdList/Id/text()" "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=taxonomy&term=Homo Sapiens[SCIN]" | xargs -I TAXON xmllint  --xpath "/TaxaSet/Taxon/Lineage/text()" "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=TAXON&retmode=xml&rettype=full"

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo

Ram · Answer 2 · 2014-10-22

2

Entering edit mode

9.5 years ago

jhc ★ 3.0k

Check this: https://github.com/jhcepas/ncbi_taxonomy#get-ncbi-topology-from-species-names

You can also make it work with fuzzy name matching:

https://github.com/jhcepas/ncbi_taxonomy#translate-names-using-fuzzy-search-queries

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jhc ★ 3.0k

Ram · Answer 3 · 2018-04-09

2

Entering edit mode

6.0 years ago

magic136 ▴ 20

Just try this NCBI website.

input the list of multiple organism names
select the option of full taxid lineage
select "show on screen" or "save in file"

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 6.0 years ago by magic136 ▴ 20

Ram · Answer 4 · 2014-10-22

You could also use the REST URLs at the ENA. The Taxon: can be a name or id like

http://www.ebi.ac.uk/ena/data/view/Taxon:Homo%20sapiens&display=xml

You can also paste a comma-separated list of names like

http://www.ebi.ac.uk/ena/data/view/Taxon:Homo%20sapiens,Taxon:Caenorhabditis%20elegans&display=xml

You'll have to test how many names you can pass to a single URL. In R you could try this..

tax <- c("Homo sapiens", "Caenorhabditis elegans")
x <- paste(paste("Taxon:", tax, sep=""), collapse=",")
x
[1] "Taxon:Homo sapiens,Taxon:Caenorhabditis elegans"

url <- paste( "http://www.ebi.ac.uk/ena/data/view/", x, "&display=xml", sep="")
doc <- xmlParse(url)

xpathSApply(doc, "/ROOT/taxon",  xmlGetAttr, "scientificName")
[1] "Caenorhabditis elegans" "Homo sapiens"

x <- getNodeSet(doc, "/ROOT/taxon")
# full lineage
sapply(x, function(y) paste(rev(xpathSApply(y, ".//lineage/taxon", xmlGetAttr, "scientificName")), collapse="; ") )

# ranks only
sapply(x, function(y) paste(rev(xpathSApply(y, ".//lineage/taxon[@rank]", xmlGetAttr, "scientificName")), collapse="; ") )

[1] "Eukaryota; Metazoa; Nematoda; Chromadorea; Rhabditida; Rhabditoidea; Rhabditidae; Peloderinae; Caenorhabditis"
[2] "Eukaryota; Metazoa; Chordata; Craniata; Mammalia; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo"

Ram · Answer 5 · 2016-08-15

0

Entering edit mode

7.7 years ago

-_- ★ 1.1k

The whole NCBI taxonomy database is not that big. I have written some code to convert NCBI taxdump into lineages, https://github.com/zyxue/ncbitax2lin. You may find it useful.

ADD COMMENT • link 7.7 years ago by -_- ★ 1.1k

0

Entering edit mode

Hi, I have been using the lineages.csv file that you've generated as a lookup table to pull the lineages of the list of Taxa IDs I have. Some TaxaIDs dont return lineages when I try to extract them using the script I wrote. But when I put the same list of IDs on the NCBI taxonomy website they are mapped to another TaxaID called the primary taxa. So I was wondering if the lineages.csv is not updated. I've tried generating new lineages.csv but the software gives out this error:

Traceback (most recent call last):
  File "ncbitax2lin.py", line 224, in <module>
    main()
  File "ncbitax2lin.py", line 192, in main
    lineages_df.sort_values('tax_id', inplace=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2083, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'sort_values'
Makefile:2: recipe for target 'ncbitax2lin' failed
make: *** [ncbitax2lin] Error 1

I'm not that familiar with pandas and a beginner in python. I actually want to get the lineage for the list of TaxaIDs I have. Any suggestions will be appreciated. Thanks.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 7.2 years ago by chetana ▴ 60

0

Entering edit mode

Sorry for late reply. As for the error, probably your pandas version is not new enough. http://stackoverflow.com/questions/34499728/dataframe-object-has-no-attribute-sort-values.

ADD REPLY • link 7.2 years ago by -_- ★ 1.1k

0

Entering edit mode

I have updated the lineage.csv.gz, and also the regeneration process more clear. See the updated README file if interested.

ADD REPLY • link 7.2 years ago by -_- ★ 1.1k