Global ETD Search

1	Hybrid Machine Translation Approaches for Low-Resource Languages / Hybrid Machine Translation Approaches for Low-Resource Languages Kamran, Amir January 2011 (has links) In recent years, corpus based machine translation systems produce significant results for a number of language pairs. However, for low-resource languages like Urdu the purely statistical or purely example based methods are not performing well. On the other hand, the rule-based approaches require a huge amount of time and resources for the development of rules, which makes it difficult in most scenarios. Hybrid machine translation systems might be one of the solutions to overcome these problems, where we can combine the best of different approaches to achieve quality translation. The goal of the thesis is to explore different combinations of approaches and to evaluate their performance over the standard corpus based methods currently in use. This includes: 1. Use of syntax-based and dependency-based reordering rules with Statistical Machine Translation. 2. Automatic extraction of lexical and syntactic rules using statistical methods to facilitate the Transfer-Based Machine Translation. The novel element in the proposed work is to develop an algorithm to learn automatic reordering rules for English-to-Urdu statistical machine translation. Moreover, this approach can be extended to learn lexical and syntactic rules to build a rule-based machine translation system.
2	Neural Dependency Parsing of Low-resource Languages: A Case Study on Marathi Zhang, Wenwen January 2022 (has links) Cross-lingual transfer has been shown effective for dependency parsing of some low-resource languages. It typically requires closely related high-resource languages. Pre-trained deep language models significantly improve model performance in cross-lingual tasks. We evaluate cross-lingual model transfer on parsing Marathi, a low-resource language that does not have a closely related highresource language. In addition, we investigate monolingual modeling for comparison. We experiment with two state-of-the-art language models: mBERT and XLM-R. Our experimental results illustrate that the cross-lingual model transfer approach still holds with distantly related source languages, and models benefit most from XLM-R. We also evaluate the impact of multi-task learning by training all UD tasks simultaneously and find that it yields mixed results for dependency parsing and degrades the transfer performance of the best performing source language Ancient Greek. Humanities and the Arts Humaniora och konst
3	Pivot-Based Bilingual Dictionary Creation for Low-Resource Languages / 低資源言語のためのピボット型対訳辞書生成 Mairidan, Wushouer 23 March 2015 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19117号 / 情博第563号 / 新制\|\|情\|\|99(附属図書館) / 32068 / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授石田亨, 教授吉川正俊, 教授河原達也 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Bilingual Dictionary Induction Constraint Satisfaction Problem Heuristic Framework Pivot Language Low-Resource Languages 007
4	Flexible Structured Prediction in Natural Language Processing with Partially Annotated Corpora Xiao Zhang (8776265) 29 April 2020 (has links) <div>Structured prediction makes coherent decisions as structured objects to present the interrelations of these predicted variables. They have been widely used in many areas, such as bioinformatics, computer vision, speech recognition, and natural language processing. Machine Learning with reduced supervision aims to leverage the laborious and error-prone annotation effects and benefit the low-resource languages. In this dissertation we study structured prediction with reduced supervision for two sets of problems, sequence labeling and dependency parsing, both of which are representatives of structured prediction problems in NLP. We investigate three different approaches.</div><div> </div><div>The first approach is learning with modular architecture by task decomposition. By decomposing the labels into location sub-label and type sub-label, we designed neural modules to tackle these sub-labels respectively, with an additional module to infuse the information. The experiments on the benchmark datasets show the modular architecture outperforms existing models and can make use of partially labeled data together with fully labeled data to improve on the performance of using fully labeled data alone.</div><div><br></div><div>The second approach builds the neural CRF autoencoder (NCRFAE) model that combines a discriminative component and a generative component for semi-supervised sequence labeling. The model has a unified structure of shared parameters, using different loss functions for labeled and unlabeled data. We developed a variant of the EM algorithm for optimizing the model with tractable inference. The experiments on several languages in the POS tagging task show the model outperforms existing systems in both supervised and semi-supervised setup.</div><div><br></div><div>The third approach builds two models for semi-supervised dependency parsing, namely local autoencoding parser (LAP) and global autoencoding parser (GAP). LAP assumes the chain-structured sentence has a latent representation and uses this representation to construct the dependency tree, while GAP treats the dependency tree itself as a latent variable. Both models have unified structures for sentence with and without annotated parse tree. The experiments on several languages show both parsers can use unlabeled sentences to improve on the performance with labeled sentences alone, and LAP is faster while GAP outperforms existing models.</div> Natural Language Processing Semi-Supervised Learning Structured Prediction Machine Learning Dependency Parsing Sequence Labeling Low-resource Languages
5	Bilingual Lexicon Induction Framwork for Closely Related Languages / 近縁言語のための帰納的な対訳辞書生成フレームワーク Arbi, Haza Nasution 25 September 2018 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21395号 / 情博第681号 / 新制\|\|情\|\|117(附属図書館) / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授石田亨, 教授吉川正俊, 教授河原達也 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM closely related languages low-resource languages language similarity clustering plan optimization 007
6	Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource Languages Bhowmik, Kowshik January 2022 (has links) No description available. Artificial Intelligence Cross-Lingual Word Embeddings Word Embeddings Low-Resource Languages Bilingual Lexicon Induction Computational Linguistics Natural Language Processing
7	Breaking Language Barriers: Enhancing Multilingual Representation for Sentence Alignment and Translation / 言語の壁を超える：文のアラインメントと翻訳のための多言語表現の改善 Mao, Zhuoyuan 25 March 2024 (has links) 京都大学 / 新制・課程博士 / 博士(情報学) / 甲第25420号 / 情博第858号 / 新制\|\|情\|\|144(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)特定教授黒橋禎夫, 教授河原達也, 教授鹿島久嗣 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Multilingual Representation Multilingual Sentence Embedding Multilingual Neural Machine Translation Low-resource Languages Zero-shot Translation Training and Inference Efficiency 7
8	Unsupervised word discovery for computational language documentation / Découverte non-supervisée de mots pour outiller la linguistique de terrain Godard, Pierre 16 April 2019 (has links) La diversité linguistique est actuellement menacée : la moitié des langues connues dans le monde pourraient disparaître d'ici la fin du siècle. Cette prise de conscience a inspiré de nombreuses initiatives dans le domaine de la linguistique documentaire au cours des deux dernières décennies, et 2019 a été proclamée Année internationale des langues autochtones par les Nations Unies, pour sensibiliser le public à cette question et encourager les initiatives de documentation et de préservation. Néanmoins, ce travail est coûteux en temps, et le nombre de linguistes de terrain, limité. Par conséquent, le domaine émergent de la documentation linguistique computationnelle (CLD) vise à favoriser le travail des linguistes à l'aide d'outils de traitement automatique. Le projet Breaking the Unwritten Language Barrier (BULB), par exemple, constitue l'un des efforts qui définissent ce nouveau domaine, et réunit des linguistes et des informaticiens. Cette thèse examine le problème particulier de la découverte de mots dans un flot non segmenté de caractères, ou de phonèmes, transcrits à partir du signal de parole dans un contexte de langues très peu dotées. Il s'agit principalement d'une procédure de segmentation, qui peut également être couplée à une procédure d'alignement lorsqu'une traduction est disponible. En utilisant deux corpus en langues bantoues correspondant à un scénario réaliste pour la linguistique documentaire, l'un en Mboshi (République du Congo) et l'autre en Myene (Gabon), nous comparons diverses méthodes monolingues et bilingues de découverte de mots sans supervision. Nous montrons ensuite que l'utilisation de connaissances linguistiques expertes au sein du formalisme des Adaptor Grammars peut grandement améliorer les résultats de la segmentation, et nous indiquons également des façons d'utiliser ce formalisme comme outil de décision pour le linguiste. Nous proposons aussi une variante tonale pour un algorithme de segmentation bayésien non-paramétrique, qui utilise un schéma de repli modifié pour capturer la structure tonale. Pour tirer parti de la supervision faible d'une traduction, nous proposons et étendons, enfin, une méthode de segmentation neuronale basée sur l'attention, et améliorons significativement la performance d'une méthode bilingue existante. / Language diversity is under considerable pressure: half of the world’s languages could disappear by the end of this century. This realization has sparked many initiatives in documentary linguistics in the past two decades, and 2019 has been proclaimed the International Year of Indigenous Languages by the United Nations, to raise public awareness of the issue and foster initiatives for language documentation and preservation. Yet documentation and preservation are time-consuming processes, and the supply of field linguists is limited. Consequently, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools. The Breaking the Unwritten Language Barrier (BULB) project, for instance, constitutes one of the efforts defining this new field, bringing together linguists and computer scientists. This thesis examines the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Using two realistic Bantu corpora for language documentation, one in Mboshi (Republic of the Congo) and the other in Myene (Gabon), we benchmark various monolingual and bilingual unsupervised word discovery methods. We then show that using expert knowledge in the Adaptor Grammar framework can vastly improve segmentation results, and we indicate ways to use this framework as a decision tool for the linguist. We also propose a tonal variant for a strong nonparametric Bayesian segmentation algorithm, making use of a modified backoff scheme designed to capture tonal structure. To leverage the weak supervision given by a translation, we finally propose and extend an attention-based neural segmentation method, improving significantly the segmentation performance of an existing bilingual method. Apprentissage non-supervisé Segmentation automatique en mots Alignement bilingue Modèles bayésiens Langues peu dotées Unsupervised learning Automatic word segmentation Bilingual alignment Bayesian models Low-resource languages
9	Head-to-head Transfer Learning Comparisons made Possible : A Comparative Study of Transfer Learning Methods for Neural Machine Translation of the Baltic Languages Stenlund, Mathias January 2023 (has links) The struggle of training adequate MT models using data-hungry NMT frameworks for low-resource language pairs has created a need to alleviate the scarcity of sufficiently large parallel corpora. Different transfer learning methods have been introduced as possible solutions to this problem, where a new model for a target task is initialized using parameters learned from some other high-resource task. Many of these methods are claimed to increase the translation quality of NMT systems in some low-resource environments, however, they are often proven to do so using different parent and child language pairs, a variation in data size, NMT frameworks, and training hyperparameters, which makes comparing them impossible. In this thesis project, three such transfer learning methods are put head-to-head in a controlled environment where the target task is to translate from the under-resourced Baltic languages Lithuanian and Latvian to English. In this controlled environment, the same parent language pairs, data sizes, data domains, transformer framework, and training parameters are used to ensure fair comparisons between the three transfer learning methods. The experiments involve training and testing models using all different combinations of transfer learning methods, parent language pairs, and either in-domain or out-domain data for an extensive study where different strengths and weaknesses are observed. The results display that Multi-Round Transfer Learning improves the overall translation quality the most but, at the same time, requires the longest training time by far. The Parameter freezing method provides a marginally lower overall improvement of translation quality but requires only half the training time, while Trivial Transfer learning improves quality the least. Both Polish and Russian work well as parents for the Baltic languages, while web-crawled data improves out-domain translations the most. The results suggest that all transfer learning methods are effective in a simulated low-resource environment, however, none of them can compete with simply having a larger target language pair data set, due to none of them overcoming the strong higher-resource baseline. machine translation transfer learning Latvian Lithuanian low-resource languages transformers parent language child language comparative study
10	Exploring source languages for Faroese in single-source and multi-source transfer learning using language-specific and multilingual language models Fischer, Kristóf January 2024 (has links) Cross-lingual transfer learning has been the driving force of low-resource natural language processing in recent years, relying on massively multilingual language models with hopes of solving the data scarcity issue for languages with a limited digital presence. However, this "one-size-fits-all" approach is not equally applicable to all low-resource languages, suggesting limitations of such models in cross-lingual transfer. Besides, known similarities and phylogenetic relationships between source and target languages are often overlooked. In this work, the emphasis is placed on Faroese, a low-resource North Germanic language with several closely related resource-rich sibling languages. The cross-lingual transfer potential from these strong Scandinavian source candidates, as well as from additional genetically related, geographically proximate, and syntactically similar source languages is studied in single-source and multi-source experiments, in terms of Faroese syntactic parsing and part-of-speech tagging. In addition, the effect of task-specific fine-tuning on monolingual, linguistically informed smaller multilingual, and massively multilingual pre-trained language models is explored. The results suggest Icelandic as a strong source candidate, however, only when fine-tuning a monolingual model. With multilingual models, task-specific fine-tuning in Norwegian and Swedish seems even more beneficial. Although they do not surpass fully Scandinavian fine-tuning, models trained on genetically related and syntactically similar languages produce good results. Additionally, the findings indicate that multilingual models outperform models pre-trained on a single language, and that even better results can be achieved using a smaller, linguistically informed model, compared to a massively multilingual one. low-resource languages natural language processing faroese dependency parsing part-of-speech tagging scandinavian languages multilingual language models

Search results