- Fetch these documents from pubmed
- Encode them, e.g., in Term Frequency / Inverse Document Frequency format, for a classification library
For example, let's use the example_usage module to demonstrate the functionality.
The example_usage module builds a dictionary mapping pubmed ids to labels (what these mean are immaterial here). This dictionary is passed to the fetch_and_encode method in pubmedpy. That's it. Encoding takes awhile, but if all goes well, under the "output" directory, there will be "AB" and "TI" directories (abstracts and titles, respectively). Under these, there should be "cleaned" and "encoded" folders. The "encoded" folder should contain a single file with your data encoded in the ``libsvm'' sparse formatted style, shown below:import pubmedpy
pubmedpy.set_email(your@email.com)
lbl_dict = get_pmid_to_label_dict()
pubmedpy.fetch_and_encode(lbl_dict.keys(), "output", lbl_dict = lbl_dict)
This is one abstract per line. So above, the abstract has label '1', and the 163rd word -- which, incidentally, is 'positively', as we can see by looking at the "words_index.txt" file generated in the "cleaned" directory -- has a tf-idf value of .085 or so. Note that other files can be generated, e.g., Weka's ARFF, with very little tweaking of the code (indeed, there's already a method in the tfidf2 library to dump to ARFF).
1 163:0.0847798892434 297:0.263181728 ..
The library requires BioPython and can be fetched at Github: http://github.com/bwallace/pubmedpy/
A great article indeed and a very detailed, realistic and superb analysis, of this issue, very nice write up, Thanks.
ReplyDeleteADP Assay Kit