1 |
Extracting Known Side Effects from Summaries of Product Characteristics (SmPCs) Provided in PDF Format by the European Medicines Agency (EMA) using BERT and PythonBuakhao, Rinyarat January 2024 (has links)
Medicines and vaccines have revolutionized disease prevention and treatment, offering numerous benefits. However, they also raise concerns about Adverse Drug Reactions (ADRs), which can have severe consequences. Summaries of Product Characteristics (SmPCs), provided by the European Medicines Agency (EMA), and Structured Product Labelings (SPLs), provided by the Food and Drug Administration (FDA), are valuable sources of information on drug-ADR relations. Understanding these relations is crucial as it contributes to establishing labeled datasets for known ADRs and advancing statistical assessment methods. Uppsala Monitoring Centre (UMC) has developed a text mining pipeline to extract known ADRs from SPLs. While the pipeline works effectively with SPLs, it faces challenges with SmPCs provided in PDF format. This study explores extending the scanner component of the pipeline by utilizing Python PDF extraction libraries to extract text from SmPCs and fine-tuning domain-specific pre-trained BERT-based models for Named Entity Recognition (NER), which is a Natural Language Processing (NLP) task, aiming to identify known ADRs from SmPCs. The investigation finds pypdfium2 [1] to be the optimal Python PDF extraction library, and fine-tuned PubMedBERT—a domain-specific language model pre-training from scratch [2]—for the NER task achieves the best performance in identifying ADRs from SmPCs. The model's performance, evaluated using entity-level evaluation metrics including Exact, Covering, and Overlap match metrics, achieves F1-scores of 0.9138, 0.9268, and 0.9671, respectively, indicating significantly good performance. Consequently, the extension model investigated in this study will be integrated into the existing pipeline by UMC professionals.
|
Page generated in 0.0259 seconds