In this thesis, the development of an automatic speech recognition (ASR) system for Northern Sotho, a low-resource language in South Africa, is investigated. Low-resource languages face challenges such as limited linguistic data and insufficient computational resources. In an attempt to alleviate these challenges, the multilingual Wav2Vec2-XLSR model is fine-tuned using Northern Sotho speech data with two main strategies to improve ASR performance: inclusion of background noise during training and semi-supervised learning with additional generated labels. An additional dataset compiled from news in Northern Sotho is used for evaluation of the models. The experiments demonstrate that moderate levels of background noise can enhance model robustness, though excessive noise degrades performance, particularly on clean data. Semi-supervised learning with generated labels proves beneficial, especially when working with smaller labelled datasets, though optimal results are always achieved with large, in-domain labelled datasets. The last finding is confirmed by the additional news dataset, which proved extremely challenging, with high error rates achieved by models trained on clean data and limited benefits of noise augmentation.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-533377 |
Date | January 2024 |
Creators | Przezdziak, Agnieszka |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0018 seconds