1 |
Natural language processing techniques for the purpose of sentinel event information extractionBarrett, Neil 23 November 2012 (has links)
An approach to biomedical language processing is to apply existing natural language processing (NLP) solutions to biomedical texts. Often, existing NLP solutions are less successful in the biomedical domain relative to their non-biomedical domain performance (e.g., relative to newspaper text). Biomedical NLP is likely best served by methods, information and tools that account for its particular challenges. In this thesis, I describe an NLP system specifically engineered for sentinel event extraction from clinical documents. The NLP system's design accounts for several biomedical NLP challenges. The specific contributions are as follows.
- Biomedical tokenizers differ, lack consensus over output tokens and are difficult to extend. I developed an extensible tokenizer, providing a tokenizer design pattern and implementation guidelines. It evaluated as equivalent to a leading biomedical tokenizer (MedPost).
- Biomedical part-of-speech (POS) taggers are often trained on non-biomedical corpora and applied to biomedical corpora. This results in a decrease in tagging accuracy. I built a token centric POS tagger, TcT, that is more accurate than three existing POS taggers (mxpost, TnT and Brill) when trained on a non-biomedical corpus and evaluated on biomedical corpora. TcT achieves this increase in tagging accuracy by ignoring previously assigned POS tags and restricting the tagger's scope to the current token, previous token and following token.
- Two parsers, MST and Malt, have been evaluated using perfect POS tag input. Given that perfect input is unlikely in biomedical NLP tasks, I evaluated these two parsers on imperfect POS tag input and compared their results. MST was most affected by imperfectly POS tagged biomedical text. I attributed MST's drop in performance to verbs and adjectives where MST had more potential for performance loss than Malt. I attributed Malt's resilience to POS tagging errors to its use of a rich feature set and a local scope in decision making.
- Previous automated clinical coding (ACC) research focuses on mapping narrative phrases to terminological descriptions (e.g., concept descriptions). These methods make little or no use of the additional semantic information available through topology. I developed a token-based ACC approach that encodes tokens and manipulates token-level encodings by mapping linguistic structures to topological operations in SNOMED CT. My ACC method recalled most concepts given their descriptions and performed significantly better than MetaMap.
I extended my contributions for the purpose of sentinel event extraction from clinical letters. The extensions account for negation in text, use medication brand names during ACC and model (coarse) temporal information. My software system's performance is similar to state-of-the-art results. Given all of the above, my thesis is a blueprint for building a biomedical NLP system. Furthermore, my contributions likely apply to NLP systems in general. / Graduate
|
2 |
Deep Neural Networks for Inverse De-Identification of Medical Case Narratives in Reports of Suspected Adverse Drug Reactions / Djupa neuronnät för omvänd avidentifiering av medicinska fallbeskrivningar i biverkningsrapporterMeldau, Eva-Lisa January 2018 (has links)
Medical research requires detailed and accurate information on individual patients. This is especially so in the context of pharmacovigilance which amongst others seeks to identify previously unknown adverse drug reactions. Here, the clinical stories are often the starting point for assessing whether there is a causal relationship between the drug and the suspected adverse reaction. Reliable automatic de-identification of medical case narratives could allow to share this patient data without compromising the patient’s privacy. Current research on de-identification focused on solving the task of labelling the tokens in a narrative with the class of sensitive information they belong to. In this Master’s thesis project, we explore an inverse approach to the task of de-identification. This means that de-identification of medical case narratives is instead understood as identifying tokens which do not need to be removed from the text in order to ensure patient confidentiality. Our results show that this approach can lead to a more reliable method in terms of higher recall. We achieve a recall of sensitive information of 99.1% while the precision is kept above 51% for the 2014-i2b2 benchmark data set. The model was also fine-tuned on case narratives from reports of suspected adverse drug reactions, where a recall of sensitive information of more than 99% was achieved. Although the precision was only at a level of 55%, which is lower than in comparable systems, an expert could still identify information which would be useful for causality assessment in pharmacovigilance in most of the case narratives which were de-identified with our method. In more than 50% of the case narratives no information useful for causality assessment was missing at all. / Tillgång till detaljerade kliniska data är en förutsättning för att bedriva medicinsk forskning och i förlängningen hjälpa patienter. Säker avidentifiering av medicinska fallbeskrivningar kan göra det möjligt att dela sådan information utan att äventyra patienters skydd av personliga data. Tidigare forskning inom området har sökt angripa problemet genom att märka ord i en text med vilken typ av känslig information de förmedlar. I detta examensarbete utforskar vi möjligheten att angripa problemet på omvänt vis genom att identifiera de ord som inte behöver avlägsnas för att säkerställa skydd av känslig patientinformation. Våra resultat visar att detta kan avidentifiera en större andel av den känsliga informationen: 99,1% av all känslig information avidentifieras med vår metod, samtidigt som 51% av alla uteslutna ord verkligen förmedlar känslig information, vilket undersökts för 2014-i2b2 jämförelse datamängden. Algoritmen anpassades även till fallbeskrivningar från biverkningsrapporter, och i detta fall avidentifierades 99,1% av all känslig information medan 55% av alla uteslutna ord förmedlar känslig information. Även om denna senare andel är lägre än för jämförbara system så kunde en expert hitta information som är användbar för kausalitetsvärdering i flertalet av de avidentifierade rapporterna; i mer än hälften av de avidentifierade fallbeskrivningarna saknades ingen information med värde för kausalitetsvärdering.
|
Page generated in 0.0822 seconds