Global ETD Search

Return to search

Extracting Known Side Effects from Summaries of Product Characteristics (SmPCs) Provided in PDF Format by the European Medicines Agency (EMA) using BERT and Python

Medicines and vaccines have revolutionized disease prevention and treatment, offering numerous benefits. However, they also raise concerns about Adverse Drug Reactions (ADRs), which can have severe consequences. Summaries of Product Characteristics (SmPCs), provided by the European Medicines Agency (EMA), and Structured Product Labelings (SPLs), provided by the Food and Drug Administration (FDA), are valuable sources of information on drug-ADR relations. Understanding these relations is crucial as it contributes to establishing labeled datasets for known ADRs and advancing statistical assessment methods. Uppsala Monitoring Centre (UMC) has developed a text mining pipeline to extract known ADRs from SPLs. While the pipeline works effectively with SPLs, it faces challenges with SmPCs provided in PDF format. This study explores extending the scanner component of the pipeline by utilizing Python PDF extraction libraries to extract text from SmPCs and fine-tuning domain-specific pre-trained BERT-based models for Named Entity Recognition (NER), which is a Natural Language Processing (NLP) task, aiming to identify known ADRs from SmPCs. The investigation finds pypdfium2 [1] to be the optimal Python PDF extraction library, and fine-tuned PubMedBERT—a domain-specific language model pre-training from scratch [2]—for the NER task achieves the best performance in identifying ADRs from SmPCs. The model's performance, evaluated using entity-level evaluation metrics including Exact, Covering, and Overlap match metrics, achieves F1-scores of 0.9138, 0.9268, and 0.9671, respectively, indicating significantly good performance. Consequently, the extension model investigated in this study will be integrated into the existing pipeline by UMC professionals.

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-528716

Datavetenskap (datalogi)

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-528716
Date	January 2024
Creators	Buakhao, Rinyarat
Publisher	Uppsala universitet, Institutionen för informationsteknologi
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	IT ; mDV 24 006

Page generated in 0.0028 seconds

Extracting Known Side Effects from Summaries of Product Characteristics (SmPCs) Provided in PDF Format by the European Medicines Agency (EMA) using BERT and Python

Description

Links & Downloads

Tags

Additional Fields