Biostar Beta. Not for public use.
Question: Automatic Data Extraction From Timetree
1
Entering edit mode

Anyone knows how to programatically extract information from http://timetree.org/

I have to build a 40x40 matrix with information about species time of divergence and my wrist is starting to hurt since I have to do all the pairwise combinations manually

UPDATE: The provided solutions stopped working

ADD COMMENTlink 6.8 years ago Biojl ♦ 1.6k • updated 3.4 years ago Biostar 20
Entering edit mode
0

Any chance you have or know of a new solution to this problem? Would love to get some of the data off the site.

ADD REPLYlink 4.5 years ago
UnivStudent
• 380
Entering edit mode
0

No, sorry. I stopped using timetree.org since without the allowance to extract data automatically is of little use in science. Just a curiosity to show to friends in the phone.
You can give it a try to DateLife.org (see last response). It didn't worked for me and I don't know if it's still on development. Test it and report your results!

ADD REPLYlink 4.5 years ago
Biojl
♦ 1.6k
4
Entering edit mode

say you have a text file containing a list of organisms:

$ cat input.txt
Homo Sapiens
Drosophila melanogaster
Canis lupus familiaris
Escherichia coli

the following bash script send some request with curl and extract the distance with xmllint/xpath

#!/bin/bash
IFS="
"
cat input.txt | tr " " "+" | while read O1
do
cat input.txt | tr " " "+" | while read O2
do
if [[ "${O1}" <  "${O2}" ]]
then
curl -s  "http://timetree.org/index.php?taxon_a=${O1}&taxon_b=${O2}&submit=Search" |\
xmllint --html --format --xpath 'concat("insert into SPECIES(org1,org2,dist) values (__QUOTE____A____QUOTE__,__QUOTE____B____QUOTE__,__QUOTE__",normalize-space(//span[@class="panel year block"][h1]),"__QUOTE__);#")' - 2> /dev/null |\
tr "#" "\n" |
sed -e "s/__A__/${O1}/g" |
sed -e "s/__B__/${O2}/g" |
sed -e "s/__QUOTE__/'/g" |
tr "+" " "
fi
done 
done

Result:

~$ bash organisms.sh 
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Homo Sapiens','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Drosophila melanogaster','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Homo Sapiens','94.2 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Drosophila melanogaster','782.7 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Canis lupus familiaris','Escherichia coli','2535.8 Million Years Ago');
insert into SPECIES(org1,org2,dist) values ('Escherichia coli','Homo Sapiens','2535.8 Million Years Ago');
ADD COMMENTlink 6.8 years ago Pierre Lindenbaum 120k
Entering edit mode
0

That's awesome! Unfortunately it's not working for me. I'm trying to figure out what's happening. I suspect is the --xpath argument in the xmllint. I don't see it in the manual nor I guess what should be doing.

ADD REPLYlink 6.8 years ago
Biojl
♦ 1.6k
Entering edit mode
1
$ xmllint --version
xmllint: using libxml version 20708
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib
ADD REPLYlink 6.8 years ago
Pierre Lindenbaum
120k
Entering edit mode
0

Ok. Apparently I have version 20706. I'll update it!

ADD REPLYlink 6.8 years ago
Biojl
♦ 1.6k
Entering edit mode
0

I'm not sure that will fix it. I saw some versions of xmllint missing the '--xpath' argument. But there are many ways to extract this information .: xslt, /usr/bin/xpath,a simple grep "Million Years", etc...

ADD REPLYlink 6.8 years ago
Pierre Lindenbaum
120k
Entering edit mode
0

Finally I decided to implement it in Python. It might be slower but the output is exactly as I want. Your solution was my inspiration, thank you!

ADD REPLYlink 6.8 years ago
Biojl
♦ 1.6k
4
Entering edit mode

There is no official way to automate this process, but check out the urls

http://timetree.org/index.php?taxon_a=homo&taxon_b=pongo&submit=Search

It should be straight forward to pick your favourite scripting language, build urls for each comparison and (maybe with a bit more difficulty) parse out the dates from the resulting pages.

Just a matter of deciding if the time writing the scripts is worth avoiding the pain in your wrist

ADD COMMENTlink 6.8 years ago David W 4.7k
Entering edit mode
3

Whoops, my scant answer crossed with Pierre's much more complete one. Should change mine to "do what Pierre says" :-)

ADD REPLYlink 6.8 years ago
David W
4.7k
4
Entering edit mode

Note that TimeTree asks that you don't do this; from the bottom of their page: "Currently large scale, automated, data-mining is not permitted". I haven't tested to see if it's possible (I imagine it would be, though an easy thing to do on their end would be to block your IP eventually), but they don't want you to.

We've been building a more open alternative to TimeTree called DateLife.org. It still needs more trees (TimeTree is _much_ better populated) but we encourage scraping, downloading the source, downloading the set of trees, etc. Let me know if you have patches or more trees for it.

ADD COMMENTlink 6.8 years ago omeara.brian • 50
Entering edit mode
0

Very good initiative, I'll take a look. I fail to see why TimeTree does not provide tools to mine their database, to me it's a terrible mistake, encouraging researchers not to use it.

ADD REPLYlink 6.8 years ago
Biojl
♦ 1.6k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0