Return to search

Programmatic extraction of information from unstructured clinical data and the assessment of potential impacts on epidemiological research

Background For epidemiological research purposes structured data provide identifiable and immediate access to the information that has been recorded, however, many quantitative recordings in electronic medical records are unstructured. This means researchers have to manually identify and extract information of interest. This is costly in terms of time and money and with access to larger amounts of electronically stored data this approach is becoming increasingly impractical. Method Two programmatic methods were developed to extract and classify numeric quantities and identify attributes from unstructured dosage instructions and clinical comments from The Health Improvement Network (THIN) database. Both methods are based on frequently occurring patterns of recording from which models were formed. Dosage instructions: Automated coding was achieved through the interpretation of a representative set of language phrases with identifiable traits. The dosage data table was automatically recoded and assessed for accuracy and coverage of a daily dosage value, then assessed in the context of 146 commonly prescribed medications. Clinical comments: Automated coding was achieved through the identification of a representative set of text and/or Read code qualifications. The model was initially trained on THIN data for a wide range of numeric health indicators, then tested for generalizability using comments from an alternative source and assessed for accuracy, sensitivity, and specificity using a subset of 12 commonly recorded health indicators. Results Dosage instructions: The coverage of a daily dosage value within the dosage data table was increased from 42.1% to 84.8% coverage with an accuracy of 84.6%. For the 146 medications assessed, on a per-unique-instruction basis, the coverage was 79.7% on average with an accuracy of 95.4%. On an all-recorded-instructions basis the weighted coverage was 65.9% on average with an accuracy of 99.3%. Clinical comments: For all 12 of the health indicators assessed the automated extraction achieved a specificity of >98% and an accuracy of >99%. The sensitivity was >96% for 8 of the indicators and between 52-88% for the other indicators. Conclusion Dosage instructions: The automated coding has improved the quantitative and qualitative summary for dosage instructions within THIN resulting in a substantial increase in the quantity of data available for pharmaco-epidemiological research. Clinical comments: The sensitivity of the extraction method is dependent on the consistency of recording patterns, which in turn was dependent on the ability to identify the differing patterns of qualification during training.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:677934
Date January 2015
CreatorsCochrane, Nicholas J. K.
PublisherUniversity of Nottingham
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://eprints.nottingham.ac.uk/30582/

Page generated in 0.002 seconds