Return to search

Low-Resource Domain Adaptation for Jihadi Discourse : Tackling Low-Resource Domain Adaptation for Neural Machine Translation Using Real and Synthetic Data

In this thesis, I explore the problem of low-resource domain adaptation for jihadi discourse. Due to the limited availability of annotated parallel data, developing accurate and effective models in this domain poses a challenging task. To address this issue, I propose a method that leverages a small in-domain manually created corpus and a synthetic corpus created from monolingual data using back-translation. I evaluate the approach by fine-tuning a pre-trained language model on different proportions of real and synthetic data and measuring its performance on a held-out test set. My experiments show that fine-tuning a model on one-fifth real parallel data and synthetic parallel data effectively reduces occurrences of over-translation and bolsters the model's ability to translate in-domain terminology. My findings suggest that synthetic data can be a valuable resource for low-resource domain adaptation, especially when real parallel data is difficult to obtain. The proposed method can be extended to other low-resource domains where annotated data is scarce, potentially leading to more accurate models and better translation of these domains.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-503371
Date January 2023
CreatorsTollersrud, Thea
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0015 seconds