Global ETD Search

11	Cross-Lingual and Genre-Supervised Parsing and Tagging for Low-Resource Spoken Data Fosteri, Iliana January 2023 (has links) Dealing with low-resource languages is a challenging task, because of the absence of sufficient data to train machine-learning models to make predictions on these languages. One way to deal with this problem is to use data from higher-resource languages, which enables the transfer of learning from these languages to the low-resource target ones. The present study focuses on dependency parsing and part-of-speech tagging of low-resource languages belonging to the spoken genre, i.e., languages whose treebank data is transcribed speech. These are the following: Beja, Chukchi, Komi-Zyrian, Frisian-Dutch, and Cantonese. Our approach involves investigating different types of transfer languages, employing MACHAMP, a state-of-the-art parser and tagger that uses contextualized word embeddings, mBERT, and XLM-R in particular. The main idea is to explore how the genre, the language similarity, none of the two, or the combination of those affect the model performance in the aforementioned downstream tasks for our selected target treebanks. Our findings suggest that in order to capture speech-specific dependency relations, we need to incorporate at least a few genre-matching source data, while language similarity-matching source data are a better candidate when the task at hand is part-of-speech tagging. We also explore the impact of multi-task learning in one of our proposed methods, but we observe minor differences in the model performance. dependency parsing part-of-speech tagging low-resource languages transcribed speech large language models cross-lingual learning transfer learning multi-task learning Universal Dependencies
12	Interactive Machine Assistance: A Case Study in Linking Corpora and Dictionaries Black, Kevin P 01 November 2015 (has links) (PDF) Machine learning can provide assistance to humans in making decisions, including linguistic decisions such as determining the part of speech of a word. Supervised machine learning methods derive patterns indicative of possible labels (decisions) from annotated example data. For many problems, including most language analysis problems, acquiring annotated data requires human annotators who are trained to understand the problem and to disambiguate among multiple possible labels. Hence, the availability of experts can limit the scope and quantity of annotated data. Machine-learned pre-annotation assistance, which suggests probable labels for unannotated items, can enable expert annotators to work more quickly and thus to produce broader and larger annotated resources more cost-efficiently. Yet, because annotated data is required to build the pre-annotation model, bootstrapping is an obstacle to utilizing pre-annotation assistance, especially for low-resource problems where little or no annotated data exists. Interactive pre-annotation assistance can mitigate bootstrapping costs, even for low-resource problems, by continually refining the pre-annotation model with new annotated examples as the annotators work. In practice, continually refining models has seldom been done except for the simplest of models which can be trained quickly. As a case study in developing sophisticated, interactive, machine-assisted annotation, this work employs the task of corpus-dictionary linkage (CDL), which is to link each word token in a corpus to its correct dictionary entry. CDL resources, such as machine-readable dictionaries and concordances, are essential aids in many tasks including language learning and corpus studies. We employ a pipeline model to provide CDL pre-annotations, with one model per CDL sub-task. We evaluate different models for lemmatization, the most significant CDL sub-task since many dictionary entry headwords are usually lemmas. The best performing lemmatization model is a hybrid which uses a maximum entropy Markov model (MEMM) to handle unknown (novel) word tokens and other component models to handle known word tokens. We extend the hybrid model design to the other CDL sub-tasks in the pipeline. We develop an incremental training algorithm for the MEMM which avoids wasting previous computation as would be done by simply retraining from scratch. The incremental training algorithm facilitates the addition of new dictionary entries over time (i.e., new labels) and also facilitates learning from partially annotated sentences which allows annotators to annotate words in any order. We validate that the hybrid model attains high accuracy and can be trained sufficiently quickly to provide interactive pre-annotation assistance by simulating CDL annotation on Quranic Arabic and classical Syriac data. interactive machine assistance machine-assisted annotation corpus-dictionary linkage annotation supervised machine learning string transduction hybrid probabilistic models low-resource languages low-resource language settings Arabic Quran Syriac New Testament Computer Sciences
13	Language identification for typologicallysimilar low-resource languages : A case study of Meänkieli, Kven and Finnish / Språkidentifering för typologiskt närbesläktadelågresursspråk : En fallstudie för meänkieli, kvänska och finska Larsson, Jacob January 2024 (has links) This studies examines different methods of language identification for the languages Meänkieli,Kven and Finnish. The methods explored are two n-gram based classifiers; Naive Bayes andTextCat and one word embedding based classifier; fastText. These models were trained on ap-proximately 100 000 sentences taken from the three languages and further divided into fourseparate datasets to examine how the availability of data impacts the final performance of thetrained models. The study found that the best model for the examined dataset was the fastTextclassifier, but for languages with less available material a naive Bayes classifier might be moreappropriate. / Denna studie utforskar olika metoder av språkidentifering för språken meänkieli, kvänska ochfinska. Två metoder baserade på n-gram undersöks; naive Bayes och TextCat samt en metodmed ordinbäddningar; fastText. Dessa modeller tränades på sammanlagt 100 000 meningartaget från dessa tre språk och delades vidare in i fyra delmängder för att utvärdera hur storinverkan storleken av träningsdata har på de tränade modellerna. Studien fann att den bästaimplementationen utifrån den undersökta datamängden var fastText, medans språk med färreresurser skulle förmodligen gynnas bättre av en språkidentifering byggd med en naive Bayesklassifierare. Natural language processing minority languages naive Bayes classifiers fastText TextCat machine learning low-resource languages Meänkieli Tornedalian Finnish Finnish Kven Språkteknologi minoritetspråk naiv Bayes klassifierare fastText TextCat maskininlärning lågresursspråk meänkieli tornedalsfinska kvänska finska General Language Studies and Linguistics

Page generated in 0.0712 seconds