Return to search

Monolingual and Cross-Lingual Survey Response Annotation

Multilingual natural language processing (NLP) is increasingly recognized for its potential in processing diverse text-type data, including those from social media, reviews, and technical reports. Multilingual language models like mBERT and XLM-RoBERTa (XLM-R) play a pivotal role in multilingual NLP. Notwithstanding their capabilities, the performance of these models largely relies on the availability of annotated training data. This thesis employs the multilingual pre-trained model XLM-R to examine its efficacy in sequence labelling to open-ended questions on democracy across multilingual surveys. Traditional annotation practices have been labour-intensive and time-consuming, with limited automation attempts. Previous studies often translated multilingual data into English, bypassing the challenges and nuances of native languages. Our study explores automatic multilingual annotation at the token level for democracy survey responses in five languages: Hungarian, Italian, Polish, Russian, and Spanish. The results reveal promising F1 scores, indicating the feasibility of using multilingual models for such tasks. However, the performance of these models is closely tied to the quality and nature of the training set. This research paves the way for future experiments and model adjustments, underscoring the importance of refining training data and optimizing model techniques for enhanced classification accuracy.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-516594
Date January 2023
CreatorsZhao, Yahui
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0017 seconds