Sunday, March 29, 2009

retrieving pubmed articles with BioPython

I needed to retrieve the abstract text and titles for all of the articles returned by pubmed query. Fortunately, this proved even easier than I had expected, thanks to the wonderful BioPython toolkit. Indeed, I basically needed only to borrow from the examples in this tutorial.

A sample use of the completed script is as follows:
python pubmed_fetchr.py -e youremail@gmail.com -s "phylogenetic trees"
Note that the email argument is to let NCBI know who is querying the system, which is good practice. This will create a parent directory with the query name ("phylongenetic trees", in this case) and two subdirectories underneath, corresponding to the abstract and title texts, respectively. In the abstracts directory, there will be n plain text files (where n is the number of articles returned by the query) with names corresponding to their PubMed IDs, each containing the text of the corresponding article abstract. Likewise, article title text will be saved in files named after the corresponding PubMed IDs in the titles directory.

Here is the code
. It would be straight forward to write out additional fields (i.e., fields other than title and abstract).


No comments:

Post a Comment