In a previous post I explored hacking the popular libsvm C++ library to expose additional fields to the python interface. In particular, I made the norm of |w|, the separating hyperplane found over training data, available from the python side of things. The upshot is that access to this field makes the implementation of the most popular variant of uncertainty sampling for SVMs, Tong and Koller's SIMPLE, a breeze to implement in python.
I have updated the repository with an implementation (please note that those not on OS X will likely need to recompile). In particular, I have created a learner module, wherein the "active learn" method with accepts a parametric query_function which is assumed to be a function that returns instance numbers to "label" (labeling is simulated here, as the true labels are assumed to be known). By default, the learning strategy is SIMPLE, but other query strategies could easily be "plugged in" to the framework.
Sample use:
1 dataset = [dataset.build_dataset_from_file("my_data")]
2 active_learner = learner.learner(dataset)
3 active_learner.pick_initial_training_set(200)
4 active_learner.rebuild_models(undersample_first=True)
5 active_learner.active_learn(40, num_to_label_at_each_iteration=1)
In 1, it is assumed "my_data" contains a sparse libsvm-formatted file. The learner wants a list of datasets so that ensembles can be built -- i.e., different classifiers built using different types of data. In 5, the first argument is the total number of examples to be labeled; the second is the number to label at each iteration in the active learning (i.e., the 'batch' size). I hope this is useful to anyone interested in playing with active learning.