The recently developed state-of-the-art models for Named Entity Recognition are heavily dependent upon huge amounts of available annotated data. Consequently, it is extremely challenging for data-scarce languages to obtain significant result. Several approaches have been proposed to circumvent this issue, including cross-lingual transfer learning, which is the leveraging of knowledge obtained by available resources in the source language and transfer it to a target low-resource language. Maltese is one of the many majorly underresourced languages. The main purpose of this project is to research how recently developed transformer multilingual models (Multilingual BERT and XLM-RoBERTa) perform and to ultimately set up an evaluation benchmark in zero-shot cross-lingual transfer learning for Maltese Named Entity Recognition. The models are fine-tuned on Arabic, English, Italian, Spanish and Dutch. The experiments evaluated the efficacy of the source languages and the use of multilingual data in both the training and validation stages. The experiments demonstrated that feeding multilingual data to both the training and the validation phases was mostly beneficial to the performance. However, adding it to the validation phase only was generally detrimental. Furthermore, XLM-R achieved overall better scores however, employing mBERT and English as the source language yielded the best performance.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-479629 |
Date | January 2022 |
Creators | Farrugia, Kris |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0017 seconds