Sunday, December 13, 2009

Meta-Analyst

Not machine learning related, but pertinent to software and biomedicine, meta-analyst, our software for conducting meta-analyses, is publicly available. See our publication here.

Monday, July 27, 2009

pubmedpy: a simple module for fetching and tf-idf encoding biomedical texts

Recently, I've been working on biomedical text classification. A necessary but tedious task in biomedical text classification is encoding data to a format suitable for classification algorithms. This often means I have the ids for some curated set of pubmed documents, and need to:
  • Fetch these documents from pubmed
  • Encode them, e.g., in Term Frequency / Inverse Document Frequency format, for a classification library
In the past, I had various scripts to complete these steps. Here, I've aggregated these into a coherent library. You can hand pubmedpy a dictionary, mapping pubmed ids to labels. The script will pull all of the abstracts and titles (by default, additional fields can also be retrieved) for these documents, clean them (stripping stop words such as "the", etc.) and then TF-IDF encode them. (Alternatively, binary coding is also possible).

For example, let's use the example_usage module to demonstrate the functionality.

import pubmedpy
pubmedpy.set_email(your@email.com)
lbl_dict = get_pmid_to_label_dict()
pubmedpy.fetch_and_encode(lbl_dict.keys(), "output", lbl_dict = lbl_dict)

The example_usage module builds a dictionary mapping pubmed ids to labels (what these mean are immaterial here). This dictionary is passed to the fetch_and_encode method in pubmedpy. That's it. Encoding takes awhile, but if all goes well, under the "output" directory, there will be "AB" and "TI" directories (abstracts and titles, respectively). Under these, there should be "cleaned" and "encoded" folders. The "encoded" folder should contain a single file with your data encoded in the ``libsvm'' sparse formatted style, shown below:

1 163:0.0847798892434 297:0.263181728 ..
This is one abstract per line. So above, the abstract has label '1', and the 163rd word -- which, incidentally, is 'positively', as we can see by looking at the "words_index.txt" file generated in the "cleaned" directory -- has a tf-idf value of .085 or so. Note that other files can be generated, e.g., Weka's ARFF, with very little tweaking of the code (indeed, there's already a method in the tfidf2 library to dump to ARFF).



The library requires BioPython and can be fetched at Github: http://github.com/bwallace/pubmedpy/

Monday, April 6, 2009

active learning with python & libsvm, part 2

Active learning is a hugely useful framework for efficiently exploiting the resources of the 'expert' during training (i.e., for mitigating the amount of hand labeling that must be done by a human in order to train a classifier). Active learning works by allowing the model, or learner, to select instances for the expert to label from a pool of unlabeled data examples, rather than having him or her categorize instances at random. This is advantageous because unlabeled data is often copious and cheaply available, but the time of a human expert is an immensely valuable resource. Burr Settles has written an excellent survey of the active learning literature.

In a previous post I explored hacking the popular libsvm C++ library to expose additional fields to the python interface. In particular, I made the norm of |w|, the separating hyperplane
found over training data, available from the python side of things. The upshot is that access to this field makes the implementation of the most popular variant of uncertainty sampling for SVMs, Tong and Koller's SIMPLE, a breeze to implement in python.

I have updated the repository
with an implementation (please note that those not on OS X will likely need to recompile). In particular, I have created a learner module, wherein the "active learn" method with accepts a parametric query_function which is assumed to be a function that returns instance numbers to "label" (labeling is simulated here, as the true labels are assumed to be known). By default, the learning strategy is SIMPLE, but other query strategies could easily be "plugged in" to the framework.

Sample use:
1 dataset = [dataset.build_dataset_from_file("my_data")]
2 active_learner = learner.learner(dataset)
3 active_learner.pick_initial_training_set(200)
4 active_learner.rebuild_models(undersample_first=True)
5 active_learner.active_learn(40, num_to_label_at_each_iteration=1)

In 1, it is assumed "my_data" contains a sparse libsvm-formatted file. The learner wants a list of datasets so that ensembles can be built -- i.e., different classifiers built using different types of data. In 5, the first argument is the total number of examples to be labeled; the second is the number to label at each iteration in the active learning (i.e., the 'batch' size). I hope this is useful to anyone interested in playing with active learning.

Sunday, March 29, 2009

retrieving pubmed articles with BioPython

I needed to retrieve the abstract text and titles for all of the articles returned by pubmed query. Fortunately, this proved even easier than I had expected, thanks to the wonderful BioPython toolkit. Indeed, I basically needed only to borrow from the examples in this tutorial.

A sample use of the completed script is as follows:
python pubmed_fetchr.py -e youremail@gmail.com -s "phylogenetic trees"
Note that the email argument is to let NCBI know who is querying the system, which is good practice. This will create a parent directory with the query name ("phylongenetic trees", in this case) and two subdirectories underneath, corresponding to the abstract and title texts, respectively. In the abstracts directory, there will be n plain text files (where n is the number of articles returned by the query) with names corresponding to their PubMed IDs, each containing the text of the corresponding article abstract. Likewise, article title text will be saved in files named after the corresponding PubMed IDs in the titles directory.

Here is the code
. It would be straight forward to write out additional fields (i.e., fields other than title and abstract).


Friday, March 13, 2009

active learning with python & libsvm (or; on hacking libsvm)

Arguably the most popular Support Vector Machine (SVM) library is libsvm. It is widely used, cross-platform and open-source. Moreover, while written in C++, there are myriad interfaces that provide access to the library in every language from Python to Matlab to brainf*ck. OK, not brainf*ck.

Here I'll focus on the bridge from Python. In particular, I was interested in using active learning, a useful framework for interactively training classifiers. In active learning, an expert provides labeled training examples for a model (e.g., an SVM), and the model iteratively selects new points from the remaining unlabeled data for labeling at each step. This differs from the canonical learning framework wherein you train your classifier on a random set of labeled data. The intuition is that the classifier can select informative examples for the expert to label at each iteration in the training process (in other words, active learning better exploits the expert). The upshot is that the most popular active learning strategy (i.e., method of selecting maximally informative examples for labeling) for SVMs is known as SIMPLE, which picks the (unlabeled) point closest to the separating hyperplane for labeling at each step. This requires access to the norm of w, the vector orthogonal to the separating hyperplane.

Unfortunately, libsvm doesn't automatically compute this. The libsvm folks do, however, outline how one might go about this in their FAQ:
The distance is |decision_value| / |w|. We have |w|^2 = w^Tw = alpha^T Q alpha = 2*(dual_obj + sum alpha_i). Thus in svm.cpp please find the place where we calculate the dual objective value (i.e., the subroutine Solve()) and add a statement to print w^Tw.
Thus the following steps were required to extract |w| from the library: (1) hack the C++ code as outlined above and (2) provide access to the computed value via Python. The first step was simple enough. After some modifications to various structs (e.g., svm_model) to keep hold of the computed|w| value, I ultimately added the following method to svm.cpp:
// get |w|^2
double svm_get_model_w2(struct svm_model *model)
{
return model->w_2;
}
Next, the header (svm.h) needed be updated to reflect this change. libsvm next needs to be rebuilt. On *nix (including OS X) this can be done with make. On Windows, I managed to do it with Visual Studio by first making sure the vcvar.bat file was in my path:
>"C:\Program Files\Microsoft Visual Studio 8\VC\bin\vcvar.bat"
Then:
> nmake -f Makefile.win clean
> nmake -f Makefile.win all
Next we need to modify the Python interface. In the "Python" directory, add the method header to the svmc.i interface files:
double svm_get_model_w2(struct svm_model *model);
Finally, rebuild the interface:
>swig -python svmci.i
>python setup.py build build_ext --inplace
Now the added method will be available through svm.svmc.svm_get_model_w2. If you're interested, the source is available @ http://code.google.com/p/libsvm288fork/