Monday, April 6, 2009

active learning with python & libsvm, part 2

Active learning is a hugely useful framework for efficiently exploiting the resources of the 'expert' during training (i.e., for mitigating the amount of hand labeling that must be done by a human in order to train a classifier). Active learning works by allowing the model, or learner, to select instances for the expert to label from a pool of unlabeled data examples, rather than having him or her categorize instances at random. This is advantageous because unlabeled data is often copious and cheaply available, but the time of a human expert is an immensely valuable resource. Burr Settles has written an excellent survey of the active learning literature.

In a previous post I explored hacking the popular libsvm C++ library to expose additional fields to the python interface. In particular, I made the norm of |w|, the separating hyperplane
found over training data, available from the python side of things. The upshot is that access to this field makes the implementation of the most popular variant of uncertainty sampling for SVMs, Tong and Koller's SIMPLE, a breeze to implement in python.

I have updated the repository
with an implementation (please note that those not on OS X will likely need to recompile). In particular, I have created a learner module, wherein the "active learn" method with accepts a parametric query_function which is assumed to be a function that returns instance numbers to "label" (labeling is simulated here, as the true labels are assumed to be known). By default, the learning strategy is SIMPLE, but other query strategies could easily be "plugged in" to the framework.

Sample use:
1 dataset = [dataset.build_dataset_from_file("my_data")]
2 active_learner = learner.learner(dataset)
3 active_learner.pick_initial_training_set(200)
4 active_learner.rebuild_models(undersample_first=True)
5 active_learner.active_learn(40, num_to_label_at_each_iteration=1)

In 1, it is assumed "my_data" contains a sparse libsvm-formatted file. The learner wants a list of datasets so that ensembles can be built -- i.e., different classifiers built using different types of data. In 5, the first argument is the total number of examples to be labeled; the second is the number to label at each iteration in the active learning (i.e., the 'batch' size). I hope this is useful to anyone interested in playing with active learning.