9.6 years ago
@Chris Evelo1350
We actually published what we did [?]here[?]. You can also do less complicated but interesting things like what we did [?] here[?] (sorry not open access).
Updated in response to the comments.
We downloaded content from Pubmed (using the license for full but actually the abstracts would have done). Our approach is really content directed that is central to the way we build the corpus.
We first tokenised the text (you could say break into words) and then combined the words into combined tokens that are lemma's. These meaningful terms; glutathion S-transferase would be one token, not 2 or 3 and a lemma should cover all synonyms. Being able to find the lemma's is one of the reasons semantic web approaches like the concept web are so important. We counted the occurrence of those in each individual abstract.
Next we used a set of publications that we knew to be relevant for the topic we were after (carotenoids). We counted all tokens in that set of texts (our initial corpus) by asking experts and taking the references from the existing pathway. We then created a vectors of tokens that occurred typically in that corpus using the counts mentioned above. In essence the difference between that vector and the vector describing all of Pubmed determines what is specific for your start corpus (the texts known to be about your topic). After that we compared that vector with every individual vector for all abstracts from Pubmed to find the ones that had a descriptive vector of tokens that is close to the one describing our start corpus. Matching texts were added to the corpus. In other words we added the papers that contained the same distribution of words like the ones we already had to our corpus. So we did not use all of Pubmed, only the relevant papers, but we found the relevant ones using an automated procedure.
Next we used these same lists of lemmatized tokens to find terms over represented in our new extended corpus that we didn't already have in the pathway we wanted to extend and had the result judged by experts.
With regard to databases, it's very easy to parse PubMed XML into a hash and just drop it into a document-oriented database, such as MongoDB. This is the approach I used for my PubMed retractions project: https://github.com/neilfws/PubMed.
Great, thanks for the thoughtful answer. I learned a lot.