Question

How Do People Go About Pubmed Text Mining

29

Entering edit mode

12.8 years ago

Lyco ★ 2.3k

I am asking this purely out of curiosity (I have no plans to actually do this), but I wonder how the various groups who analyse Pubmed abstracts for co-citation or gene-interaction (or anything else) actually access the data ?

Do they run a lot of batch pubmed searches and parse the results, do they access the pubmed database via some kind of web-service API, or is there a way of downloading the entire corpus and do the analyses locally ?

Assuming that there are several possibilities, what would be the preferred way ?

pubmed text • 37k views

ADD COMMENT • link updated 12.8 years ago by Chris Evelo 10k • written 12.8 years ago by Lyco ★ 2.3k

Ram · Answer 1 · 2011-07-08

I think the answer depends on the scale of your problem. If you want to analyze hundreds/thousands of documents, the use NCBI eutils to fetch documents from PubMed. If you have to do hardcore text/data-mining on millions of documents, you'll need to get a local copy of MEDLINE and PubMed Central. For MEDLINE, this involves getting a license. For PubMed Central, you can download the Open Access subset without a license by ftp.

We use local copies of MEDLINE and PMC for our text-mining work, and access text either through SQL databases (MEDLINE, PMC) or somtimes the filesystem (PMC w/supplements). MEDLINE is fairly straightforward to work with, although a common gotcha is that many citations have no abstracts so you need to acocunt for this in any quantification of your results. PMC is much more difficult since text comes in XML, text and pdf, and supplemental files come in an bewildering diversity of forms. A PMC gotcha is that not all PMC documents are in Pubmed and quantification must extrapolate from the 1% of the literature that is PMC OA to the totality of Pubmed.

There are published methods for transforming MEDLINE into a SQL database, which are likely out of date. It is probably a good question to post on BioStar about the best method to parse MEDLINE into a SQL DB (code golf anyone?). Lars has reformatted PMC OA here, and it would probably bee good to get his advice/code on how best to do this.

score 4 · Answer 2 · 2011-07-08

You could try Textpresso:

http://gmod.org/wiki/Textpresso http://www.textpresso.org/

which is a tool for analyzing whatever corpus you feed it. It knows about biological terms, so you can search for things like "gene A suppress gene B" and will do a semantic search of the corpus for that, and then return full sentences that support the result. You can look at running examples for E. coli here:

http://ecocyc.org/ecocyc/textpresso.shtml

and an older version for C. elegans here:

http://www.textpresso.org/celegans/

score 3 · Answer 3 · 2011-07-08

We actually published what we did [?]here[?]. You can also do less complicated but interesting things like what we did [?] here[?] (sorry not open access).

Updated in response to the comments.

We downloaded content from Pubmed (using the license for full but actually the abstracts would have done). Our approach is really content directed that is central to the way we build the corpus.

We first tokenised the text (you could say break into words) and then combined the words into combined tokens that are lemma's. These meaningful terms; glutathion S-transferase would be one token, not 2 or 3 and a lemma should cover all synonyms. Being able to find the lemma's is one of the reasons semantic web approaches like the concept web are so important. We counted the occurrence of those in each individual abstract.

Next we used a set of publications that we knew to be relevant for the topic we were after (carotenoids). We counted all tokens in that set of texts (our initial corpus) by asking experts and taking the references from the existing pathway. We then created a vectors of tokens that occurred typically in that corpus using the counts mentioned above. In essence the difference between that vector and the vector describing all of Pubmed determines what is specific for your start corpus (the texts known to be about your topic). After that we compared that vector with every individual vector for all abstracts from Pubmed to find the ones that had a descriptive vector of tokens that is close to the one describing our start corpus. Matching texts were added to the corpus. In other words we added the papers that contained the same distribution of words like the ones we already had to our corpus. So we did not use all of Pubmed, only the relevant papers, but we found the relevant ones using an automated procedure.

Next we used these same lists of lemmatized tokens to find terms over represented in our new extended corpus that we didn't already have in the pathway we wanted to extend and had the result judged by experts.

Ram · Answer 4 · 2011-07-08

Hi!

If you're interested in Pubmed/citations information around proteins you can have a look at Bio4j open-source project.

Regarding citations information you can find the info Uniprot(Swiss-Prot + Trembl) provides, which basically goes around:

Articles
Online articles
Thesis
Books
Submissions
Unpublished observations

Here you can see a model of the entities implemented regarding citations, (if you click in the shapes/links you'll be redirected to the corresponding classes).

Besides, since much more information is included in Bio4j and everything's linked together (it's a graph DB), you can take advantage of any extra information connected to any of the entities involved in your query/study.

Cheers,
Pablo

score 3 · Answer 5 · 2011-07-08

3

Entering edit mode

12.8 years ago

Gareth Palidwor ★ 1.6k

I recommend contacting NCBI for a copy of the XML data rather than screen scraping the site. The license is quite reasonable and quick to get for academic use.

To do their analysis, most groups I've seen grind through the XML with scripts to extract/preprocess what they need.

Recently I've done some messing about with Apache Lucene to index the data for lightning fast searching and extraction. Lucene is very fast (for example http://www.ogic.ca/mltrends/) at text searching.

ADD COMMENT • link 12.8 years ago by Gareth Palidwor ★ 1.6k

1

Entering edit mode

I've heard of people using BioRuby to specifically do this.

ADD REPLY • link 12.8 years ago by Burlappsack ▴ 690

Ram · Answer 6 · 2011-07-09

3

Entering edit mode

12.8 years ago

Yogesh Pandit ▴ 520

You can try this

http://alias-i.com/lingpipe/demos/tutorial/db/read-me.html

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 12.8 years ago by Yogesh Pandit ▴ 520