Return to search

Applying Active Learning to Biomedical Text Processing

Objective: Supervised machine learning methods have shown good performance in text classification tasks in the biomedical domain, but they often require large annotated corpora, which are costly to develop. Our goal is to assess whether active learning strategies can be integrated with supervised machine learning methods, thus reducing the annotation cost while keeping or improving the quality of classification models for biomedical text.
Methods: We have applied active learning to two biomedical natural language processing (NLP) tasks: 1) the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge, which was to determine the assertion status of clinical concepts; and 2) a supervised word sense disambiguation (WSD) task that was to disambiguate 197 ambiguous words and abbreviations in MEDLINE abstracts. We developed Support Vector Machines (SVMs) based classifiers for both tasks. We then implemented several existing and newly developed active learning algorithms to integrate with SVM classifiers and evaluated their performance on both tasks.
Results: In assertion classification task, our results showed that to achieve the same classification performance, active learning strategies required much fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort. In the WSD task, our results also demonstrated that active learners significantly outperformed the passive learner, showing better performance for 177 out of 197 (89.8%) ambiguous terms. Further analysis showed that to achieve an average accuracy of 90%, the passive learner needed 38 samples, while the active learners needed only 24 annotated samples, a 37% reduction of annotation effort. Moreover, we also analyzed cases where active learning algorithms did not achieve superior performance and summarized three causes: (1) poor model in early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements.
Conclusion: Both studies demonstrated that integrating active learning strategies with supervised learning methods could effectively reduce annotation cost and improve the classification models in biomedical text processing.

Identiferoai:union.ndltd.org:VANDERBILT/oai:VANDERBILTETD:etd-07122013-162658
Date29 July 2013
CreatorsChen, Yukun
ContributorsHua Xu, Joshua C. Denny, Thomas Lasko, Qiaozhu Mei
PublisherVANDERBILT
Source SetsVanderbilt University Theses
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.library.vanderbilt.edu/available/etd-07122013-162658/
Rightsunrestricted, I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to Vanderbilt University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.

Page generated in 0.0021 seconds