Global ETD Search

Return to search

Pre-training a knowledge enhanced model in biomedical domain for information extraction

While recent years have seen a rise of research in knowledge graph enrichedpre-trained language models(PLM), few studies have tried to transfer the work to the biomedical domain. This thesis is a first attempt to pre-train a large-scalebiological knowledge enriched language model (KPLM). Under the frameworkof CoLAKE (T. Sun et al., 2020), a general-use KPLM in general field, this study is pre-trained on PubMed abstracts (a large scale medical text data) andBIKG (AstraZeneca’s biological knowledge graph). We firstly get abstracts from PubMed and their entity linking results. Following this is to connect the entities from abstracts to BIKG to form sub-graphs. Such sub-graphs and sentences from PubMed abstracts are then sent to model CoLAKE for pre-training. By training the model on three objectives (masking word nodes, masking entity nodes and masking relation nodes), this research aims to not only enhancing model’s capacity on modeling natural language but also infusing in-depth knowledge. Later the model is fine-tuned on name entity recognition (NER) and relation extraction tasks on three benchmark datasets (Chemprot (Kringelumet al., 2016), DrugProt (form Text mining drug-protein/gene interactions sharedtask) and DDI (Segura-Bedmar et al., 2013)). Empirical results show that the model outperform state-of-the-art models relation extraction task on DDI dataset, with F1 score of 91.2%. Also on Drugprot and chemprot, this model shows improvement over baseline - scibert model.

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-467146

bio-nlp

knowledge enhanced language model

pre-training

information extraction. transformer

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-467146
Date	January 2022
Creators	Yan, Xi
Publisher	Uppsala universitet, Institutionen för lingvistik och filologi
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0012 seconds

Pre-training a knowledge enhanced model in biomedical domain for information extraction

Description

Links & Downloads

Tags

Additional Fields