Monday, July 27, 2009

pubmedpy: a simple module for fetching and tf-idf encoding biomedical texts

Recently, I've been working on biomedical text classification. A necessary but tedious task in biomedical text classification is encoding data to a format suitable for classification algorithms. This often means I have the ids for some curated set of pubmed documents, and need to:
  • Fetch these documents from pubmed
  • Encode them, e.g., in Term Frequency / Inverse Document Frequency format, for a classification library
In the past, I had various scripts to complete these steps. Here, I've aggregated these into a coherent library. You can hand pubmedpy a dictionary, mapping pubmed ids to labels. The script will pull all of the abstracts and titles (by default, additional fields can also be retrieved) for these documents, clean them (stripping stop words such as "the", etc.) and then TF-IDF encode them. (Alternatively, binary coding is also possible).

For example, let's use the example_usage module to demonstrate the functionality.

import pubmedpy
pubmedpy.set_email(your@email.com)
lbl_dict = get_pmid_to_label_dict()
pubmedpy.fetch_and_encode(lbl_dict.keys(), "output", lbl_dict = lbl_dict)

The example_usage module builds a dictionary mapping pubmed ids to labels (what these mean are immaterial here). This dictionary is passed to the fetch_and_encode method in pubmedpy. That's it. Encoding takes awhile, but if all goes well, under the "output" directory, there will be "AB" and "TI" directories (abstracts and titles, respectively). Under these, there should be "cleaned" and "encoded" folders. The "encoded" folder should contain a single file with your data encoded in the ``libsvm'' sparse formatted style, shown below:

1 163:0.0847798892434 297:0.263181728 ..
This is one abstract per line. So above, the abstract has label '1', and the 163rd word -- which, incidentally, is 'positively', as we can see by looking at the "words_index.txt" file generated in the "cleaned" directory -- has a tf-idf value of .085 or so. Note that other files can be generated, e.g., Weka's ARFF, with very little tweaking of the code (indeed, there's already a method in the tfidf2 library to dump to ARFF).



The library requires BioPython and can be fetched at Github: http://github.com/bwallace/pubmedpy/