To increase medicine safety, researchers use adverse event reports to assess causal relationships between drugs and suspected adverse reactions. VigiBase, the world's largest database of such reports, collects data from numerous sources, introducing the risk of several records referring to the same case. These duplicates negatively affect the quality of data and its analysis. Thus, efforts should be made to detect and clean them automatically. Today, VigiBase holds more than 3.8 million COVID-19 vaccine adverse event reports, making deduplication a challenging problem for existing solutions employed in VigiBase. This thesis project explores methods for this task, explicitly focusing on records with a COVID-19 vaccine. We implement Jaccard similarity, TF-IDF, and BERT to leverage the abundance of information contained in the free-text narratives of the reports. Mean-pooling is applied to create sentence embeddings from word embeddings produced by a pre-trained SapBERT model fine-tuned to maximise the cosine similarity between narratives of duplicate reports. Narrative similarity is quantified by the cosine similarity between sentence embeddings. We apply a Gradient Boosted Decision Tree (GBDT) model for classifying report pairs as duplicates or non-duplicates. For a more calibrated model, logistic regression fine-tunes the leaf values of the GBDT. In addition, the model successfully implements a ruleset to find reports whose narratives mention a unique identifier of its duplicate. The best performing model achieves 73.3% recall and zero false positives on a controlled testing dataset for an F1-score of 84.6%, vastly outperforming VigiBase’s previously implemented model's F1-score of 60.1%. Further, when manually annotated by three reviewers, it reached an average 87% precision when fully deduplicating 11756 reports amongst records relating to hearing disorders.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-477319 |
Date | January 2022 |
Creators | Turesson, Erik |
Publisher | Uppsala universitet, Avdelningen för systemteknik |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | UPTEC F, 1401-5757 ; 22022 |
Page generated in 0.0025 seconds