Sunday, March 29, 2009

retrieving pubmed articles with BioPython

I needed to retrieve the abstract text and titles for all of the articles returned by pubmed query. Fortunately, this proved even easier than I had expected, thanks to the wonderful BioPython toolkit. Indeed, I basically needed only to borrow from the examples in this tutorial.

A sample use of the completed script is as follows:
python pubmed_fetchr.py -e youremail@gmail.com -s "phylogenetic trees"
Note that the email argument is to let NCBI know who is querying the system, which is good practice. This will create a parent directory with the query name ("phylongenetic trees", in this case) and two subdirectories underneath, corresponding to the abstract and title texts, respectively. In the abstracts directory, there will be n plain text files (where n is the number of articles returned by the query) with names corresponding to their PubMed IDs, each containing the text of the corresponding article abstract. Likewise, article title text will be saved in files named after the corresponding PubMed IDs in the titles directory.

Here is the code
. It would be straight forward to write out additional fields (i.e., fields other than title and abstract).


Friday, March 13, 2009

active learning with python & libsvm (or; on hacking libsvm)

Arguably the most popular Support Vector Machine (SVM) library is libsvm. It is widely used, cross-platform and open-source. Moreover, while written in C++, there are myriad interfaces that provide access to the library in every language from Python to Matlab to brainf*ck. OK, not brainf*ck.

Here I'll focus on the bridge from Python. In particular, I was interested in using active learning, a useful framework for interactively training classifiers. In active learning, an expert provides labeled training examples for a model (e.g., an SVM), and the model iteratively selects new points from the remaining unlabeled data for labeling at each step. This differs from the canonical learning framework wherein you train your classifier on a random set of labeled data. The intuition is that the classifier can select informative examples for the expert to label at each iteration in the training process (in other words, active learning better exploits the expert). The upshot is that the most popular active learning strategy (i.e., method of selecting maximally informative examples for labeling) for SVMs is known as SIMPLE, which picks the (unlabeled) point closest to the separating hyperplane for labeling at each step. This requires access to the norm of w, the vector orthogonal to the separating hyperplane.

Unfortunately, libsvm doesn't automatically compute this. The libsvm folks do, however, outline how one might go about this in their FAQ:
The distance is |decision_value| / |w|. We have |w|^2 = w^Tw = alpha^T Q alpha = 2*(dual_obj + sum alpha_i). Thus in svm.cpp please find the place where we calculate the dual objective value (i.e., the subroutine Solve()) and add a statement to print w^Tw.
Thus the following steps were required to extract |w| from the library: (1) hack the C++ code as outlined above and (2) provide access to the computed value via Python. The first step was simple enough. After some modifications to various structs (e.g., svm_model) to keep hold of the computed|w| value, I ultimately added the following method to svm.cpp:
// get |w|^2
double svm_get_model_w2(struct svm_model *model)
{
return model->w_2;
}
Next, the header (svm.h) needed be updated to reflect this change. libsvm next needs to be rebuilt. On *nix (including OS X) this can be done with make. On Windows, I managed to do it with Visual Studio by first making sure the vcvar.bat file was in my path:
>"C:\Program Files\Microsoft Visual Studio 8\VC\bin\vcvar.bat"
Then:
> nmake -f Makefile.win clean
> nmake -f Makefile.win all
Next we need to modify the Python interface. In the "Python" directory, add the method header to the svmc.i interface files:
double svm_get_model_w2(struct svm_model *model);
Finally, rebuild the interface:
>swig -python svmci.i
>python setup.py build build_ext --inplace
Now the added method will be available through svm.svmc.svm_get_model_w2. If you're interested, the source is available @ http://code.google.com/p/libsvm288fork/