Global ETD Search

Return to search

Cross-Lingual and Genre-Supervised Parsing and Tagging for Low-Resource Spoken Data

Dealing with low-resource languages is a challenging task, because of the absence of sufficient data to train machine-learning models to make predictions on these languages. One way to deal with this problem is to use data from higher-resource languages, which enables the transfer of learning from these languages to the low-resource target ones. The present study focuses on dependency parsing and part-of-speech tagging of low-resource languages belonging to the spoken genre, i.e., languages whose treebank data is transcribed speech. These are the following: Beja, Chukchi, Komi-Zyrian, Frisian-Dutch, and Cantonese. Our approach involves investigating different types of transfer languages, employing MACHAMP, a state-of-the-art parser and tagger that uses contextualized word embeddings, mBERT, and XLM-R in particular. The main idea is to explore how the genre, the language similarity, none of the two, or the combination of those affect the model performance in the aforementioned downstream tasks for our selected target treebanks. Our findings suggest that in order to capture speech-specific dependency relations, we need to incorporate at least a few genre-matching source data, while language similarity-matching source data are a better candidate when the task at hand is part-of-speech tagging. We also explore the impact of multi-task learning in one of our proposed methods, but we observe minor differences in the model performance.

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-505707

dependency parsing

part-of-speech tagging

low-resource languages

transcribed speech

large language models

cross-lingual learning

transfer learning

multi-task learning

Universal Dependencies

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-505707
Date	January 2023
Creators	Fosteri, Iliana
Publisher	Uppsala universitet, Institutionen för lingvistik och filologi
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0018 seconds

Cross-Lingual and Genre-Supervised Parsing and Tagging for Low-Resource Spoken Data

Description

Links & Downloads

Tags

Additional Fields