Return to search

Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

Indiana University-Purdue University Indianapolis (IUPUI) / Phenotyping definitions are essential in cohort identification when conducting
clinical research, but they become an obstacle when they are not readily available.
Developing new definitions manually requires expert involvement that is labor-intensive,
time-consuming, and unscalable. Moreover, automated approaches rely mostly on
electronic health records’ data that suffer from bias, confounding, and incompleteness.
Limited efforts established in utilizing text-mining and data-driven approaches to automate
extraction and literature-based knowledge discovery of phenotyping definitions and to
support their scalability. In this dissertation, we proposed a text-mining pipeline combining
rule-based and machine-learning methods to automate retrieval, classification, and
extraction of phenotyping definitions’ information from literature. To achieve this, we first
developed an annotation guideline with ten dimensions to annotate sentences with evidence
of phenotyping definitions' modalities, such as phenotypes and laboratories. Two
annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text
observational studies’ methods sections (n=86). Percent and Kappa statistics showed high
inter-annotator agreement on sentence-level annotations. Second, we constructed two
validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level.
We applied the abstract-level classifier on a large-scale biomedical literature of over
20 million abstracts published between 1975 and 2018 to classify positive abstracts
(n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from
their methods sections and used the full-text sentence-level classifier to extract positive
sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the
positively classified sentences. Lexica-based methods were used to recognize medical
concepts in these sentences (n=19,423). Co-occurrence and association methods were used
to identify and rank phenotype candidates that are associated with a phenotype of interest.
We derived 12,616,465 associations from our large-scale corpus. Our literature-based
associations and large-scale corpus contribute in building new data-driven phenotyping
definitions and expanding existing definitions with minimal expert involvement.

Identiferoai:union.ndltd.org:IUPUI/oai:scholarworks.iupui.edu:1805/20201
Date07 1900
CreatorsBinkheder, Samar Hussein
ContributorsJones, Josette, Li, Lang, Quinney, Sara Kay, Wu, Huanmei, Zhang, Chi
Source SetsIndiana University-Purdue University Indianapolis
Languageen_US
Detected LanguageEnglish
TypeDissertation

Page generated in 0.0022 seconds