Return to search

Active Learning for Named Entity Recognition in Clinical Text

Named entity recognition (NER) is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance. However, they often require large numbers of annotated samples, which are expensive to build with the use of domain experts in annotation. Active learning (AL), a sample selection approach that can be integrated with supervised ML, has shown the promising potential to minimize the annotation cost while maximizing the performance of ML-based models in various NLP tasks. However, very few studies have investigated AL for clinical NER in a real-life setting.
In this dissertation research, I systematically studied AL in a clinical NER task to identify medical problems, treatments, and lab tests in clinical notes. Novel AL algorithms were developed to query the most informative and least costly sentences based on three properties: uncertainty, representativeness, and annotation time. I also developed the first AL-enabled annotation system for clinical NER. Using this system, I further conducted user studies to assess the performance of AL in real world annotation processes for building clinical NER systems.
The initial user study shows that conventional AL methods with no consideration of annotation time did not always perform better than random sampling for different users. However, our newly developed AL algorithms with cost models for estimating annotation time were more promising in practice. To achieve an NER model with 0.70 in F-measure, simulated results show that the new AL method saved ~33.3% in estimated annotation time, compared to random sampling. In the user study, the new AL algorithm achieved better performance than random sampling and saved up to ~26.5% real annotation time for one of the users.
To the best of our knowledge, this is the first study examining the practical AL systems for clinical NER. Our study demonstrates that AL has the potential to save annotation time and improve model quality for building ML-based NER systems, when novel querying algorithms are implemented. Our future work includes developing better querying algorithms and evaluating the system with larger number of users.

Identiferoai:union.ndltd.org:VANDERBILT/oai:VANDERBILTETD:etd-06122015-162419
Date25 June 2015
CreatorsChen, Yukun
ContributorsJoshua C. Denny, Hua Xu, Thomas A. Lasko, Qiaozhu Mei, Qingxia Chen
PublisherVANDERBILT
Source SetsVanderbilt University Theses
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.library.vanderbilt.edu/available/etd-06122015-162419/
Rightsrestrictone, I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to Vanderbilt University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.

Page generated in 0.0077 seconds