• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 28
  • 2
  • 2
  • 2
  • Tagged with
  • 40
  • 40
  • 19
  • 14
  • 13
  • 12
  • 10
  • 10
  • 10
  • 9
  • 9
  • 9
  • 7
  • 7
  • 7
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision

Täckström, Oscar January 2013 (has links)
Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language.
32

Koreference z mezijazykové perspektivy / Coreference from the Cross-lingual Perspective

Novák, Michal January 2018 (has links)
Coreference from the Cross-lingual Perspective Michal Nov'ak The subject of this thesis is to study properties of coreference using cross- lingual approaches. The work is motivated by the research on coreference-related linguistic typology. Another motivation is to explore whether differences in the ways how languages express coreference can be exploited to build better models for coreference resolution. We design two cross-lingual methods: the bilingually informed coreference resolution and the coreference projection. The results of our experiments with the methods carried out on Czech-English data suggest that with respect to coreference English is more informative for Czech than vice versa. Furthermore, the bilingually informed resolution applied on parallel texts has managed to outperform the monolingual resolver on both languages. In the experiments, we employ the monolingual coreference resolver and an improved method for alignment of coreferential expressions, both of which we also designed within the thesis. 1
33

Modèles exponentiels et contraintes sur les espaces de recherche en traduction automatique et pour le transfert cross-lingue / Log-linear Models and Search Space Constraints in Statistical Machine Translation and Cross-lingual Transfer

Pécheux, Nicolas 27 September 2016 (has links)
La plupart des méthodes de traitement automatique des langues (TAL) peuvent être formalisées comme des problèmes de prédiction, dans lesquels on cherche à choisir automatiquement l'hypothèse la plus plausible parmi un très grand nombre de candidats. Malgré de nombreux travaux qui ont permis de mieux prendre en compte la structure de l'ensemble des hypothèses, la taille de l'espace de recherche est généralement trop grande pour permettre son exploration exhaustive. Dans ce travail, nous nous intéressons à l'importance du design de l'espace de recherche et étudions l'utilisation de contraintes pour en réduire la taille et la complexité. Nous nous appuyons sur l'étude de trois problèmes linguistiques — l'analyse morpho-syntaxique, le transfert cross-lingue et le problème du réordonnancement en traduction — pour mettre en lumière les risques, les avantages et les enjeux du choix de l'espace de recherche dans les problèmes de TAL.Par exemple, lorsque l'on dispose d'informations a priori sur les sorties possibles d'un problème d'apprentissage structuré, il semble naturel de les inclure dans le processus de modélisation pour réduire l'espace de recherche et ainsi permettre une accélération des traitements lors de la phase d'apprentissage. Une étude de cas sur les modèles exponentiels pour l'analyse morpho-syntaxique montre paradoxalement que cela peut conduire à d'importantes dégradations des résultats, et cela même quand les contraintes associées sont pertinentes. Parallèlement, nous considérons l'utilisation de ce type de contraintes pour généraliser le problème de l'apprentissage supervisé au cas où l'on ne dispose que d'informations partielles et incomplètes lors de l'apprentissage, qui apparaît par exemple lors du transfert cross-lingue d'annotations. Nous étudions deux méthodes d'apprentissage faiblement supervisé, que nous formalisons dans le cadre de l'apprentissage ambigu, appliquées à l'analyse morpho-syntaxiques de langues peu dotées en ressources linguistiques.Enfin, nous nous intéressons au design de l'espace de recherche en traduction automatique. Les divergences dans l'ordre des mots lors du processus de traduction posent un problème combinatoire difficile. En effet, il n'est pas possible de considérer l'ensemble factoriel de tous les réordonnancements possibles, et des contraintes sur les permutations s'avèrent nécessaires. Nous comparons différents jeux de contraintes et explorons l'importance de l'espace de réordonnancement dans les performances globales d'un système de traduction. Si un meilleur design permet d'obtenir de meilleurs résultats, nous montrons cependant que la marge d'amélioration se situe principalement dans l'évaluation des réordonnancements plutôt que dans la qualité de l'espace de recherche. / Most natural language processing tasks are modeled as prediction problems where one aims at finding the best scoring hypothesis from a very large pool of possible outputs. Even if algorithms are designed to leverage some kind of structure, the output space is often too large to be searched exaustively. This work aims at understanding the importance of the search space and the possible use of constraints to reduce it in size and complexity. We report in this thesis three case studies which highlight the risk and benefits of manipulating the seach space in learning and inference.When information about the possible outputs of a sequence labeling task is available, it may seem appropriate to include this knowledge into the system, so as to facilitate and speed-up learning and inference. A case study on type constraints for CRFs however shows that using such constraints at training time is likely to drastically reduce performance, even when these constraints are both correct and useful at decoding.On the other side, we also consider possible relaxations of the supervision space, as in the case of learning with latent variables, or when only partial supervision is available, which we cast as ambiguous learning. Such weakly supervised methods, together with cross-lingual transfer and dictionary crawling techniques, allow us to develop natural language processing tools for under-resourced languages. Word order differences between languages pose several combinatorial challenges to machine translation and the constraints on word reorderings have a great impact on the set of potential translations that is explored during search. We study reordering constraints that allow to restrict the factorial space of permutations and explore the impact of the reordering search space design on machine translation performance. However, we show that even though it might be desirable to design better reordering spaces, model and search errors seem yet to be the most important issues.
34

Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge / Apprentissage d'analyseurs syntaxiques pour les langues peu dotées : amélioration du transfert cross-lingue grâce à des connaissances monolingues

Aufrant, Lauriane 06 April 2018 (has links)
Le récent essor des algorithmes d'apprentissage automatique a rendu les méthodes de Traitement Automatique des Langues d'autant plus sensibles à leur facteur le plus limitant : la qualité des systèmes repose entièrement sur la disponibilité de grandes quantités de données, ce qui n'est pourtant le cas que d'une minorité parmi les 7.000 langues existant au monde. La stratégie dite du transfert cross-lingue permet de contourner cette limitation : une langue peu dotée en ressources (la cible) peut être traitée en exploitant les ressources disponibles dans une autre langue (la source). Les progrès accomplis sur ce plan se limitent néanmoins à des scénarios idéalisés, avec des ressources cross-lingues prédéfinies et de bonne qualité, de sorte que le transfert reste inapplicable aux cas réels de langues peu dotées, qui n'ont pas ces garanties. Cette thèse vise donc à tirer parti d'une multitude de sources et ressources cross-lingues, en opérant une combinaison sélective : il s'agit d'évaluer, pour chaque aspect du traitement cible, la pertinence de chaque ressource. L'étude est menée en utilisant l'analyse en dépendance par transition comme cadre applicatif. Le cœur de ce travail est l'élaboration d'un nouveau méta-algorithme de transfert, dont l'architecture en cascade permet la combinaison fine des diverses ressources, en ciblant leur exploitation à l'échelle du mot. L'approche cross-lingue pure n'étant en l'état pas compétitive avec la simple annotation de quelques phrases cibles, c'est avant tout la complémentarité de ces méthodes que souligne l'analyse empirique. Une série de nouvelles métriques permet une caractérisation fine des similarités cross-lingues et des spécificités syntaxiques de chaque langue, de même que de la valeur ajoutée de l'information cross-lingue par rapport au cadre monolingue. L'exploitation d'informations typologiques s'avère également particulièrement fructueuse. Ces contributions reposent largement sur des innovations techniques en analyse syntaxique, concrétisées par la publication en open source du logiciel PanParser, qui exploite et généralise la méthode dite des oracles dynamiques. Cette thèse contribue sur le plan monolingue à plusieurs autres égards, comme le concept de cascades monolingues, pouvant traiter par exemple d'abord toutes les dépendances faciles, puis seulement les difficiles. / As a result of the recent blossoming of Machine Learning techniques, the Natural Language Processing field faces an increasingly thorny bottleneck: the most efficient algorithms entirely rely on the availability of large training data. These technological advances remain consequently unavailable for the 7,000 languages in the world, out of which most are low-resourced. One way to bypass this limitation is the approach of cross-lingual transfer, whereby resources available in another (source) language are leveraged to help building accurate systems in the desired (target) language. However, despite promising results in research settings, the standard transfer techniques lack the flexibility regarding cross-lingual resources needed to be fully usable in real-world scenarios: exploiting very sparse resources, or assorted arrays of resources. This limitation strongly diminishes the applicability of that approach. This thesis consequently proposes to combine multiple sources and resources for transfer, with an emphasis on selectivity: can we estimate which resource of which language is useful for which input? This strategy is put into practice in the frame of transition-based dependency parsing. To this end, a new transfer framework is designed, with a cascading architecture: it enables the desired combination, while ensuring better targeted exploitation of each resource, down to the level of the word. Empirical evaluation dampens indeed the enthusiasm for the purely cross-lingual approach -- it remains in general preferable to annotate just a few target sentences -- but also highlights its complementarity with other approaches. Several metrics are developed to characterize precisely cross-lingual similarities, syntactic idiosyncrasies, and the added value of cross-lingual information compared to monolingual training. The substantial benefits of typological knowledge are also explored. The whole study relies on a series of technical improvements regarding the parsing framework: this work includes the release of a new open source software, PanParser, which revisits the so-called dynamic oracles to extend their use cases. Several purely monolingual contributions complete this work, including an exploration of monolingual cascading, which offers promising perspectives with easy-then-hard strategies.
35

Cross-Lingual and Genre-Supervised Parsing and Tagging for Low-Resource Spoken Data

Fosteri, Iliana January 2023 (has links)
Dealing with low-resource languages is a challenging task, because of the absence of sufficient data to train machine-learning models to make predictions on these languages. One way to deal with this problem is to use data from higher-resource languages, which enables the transfer of learning from these languages to the low-resource target ones. The present study focuses on dependency parsing and part-of-speech tagging of low-resource languages belonging to the spoken genre, i.e., languages whose treebank data is transcribed speech. These are the following: Beja, Chukchi, Komi-Zyrian, Frisian-Dutch, and Cantonese. Our approach involves investigating different types of transfer languages, employing MACHAMP, a state-of-the-art parser and tagger that uses contextualized word embeddings, mBERT, and XLM-R in particular. The main idea is to explore how the genre, the language similarity, none of the two, or the combination of those affect the model performance in the aforementioned downstream tasks for our selected target treebanks. Our findings suggest that in order to capture speech-specific dependency relations, we need to incorporate at least a few genre-matching source data, while language similarity-matching source data are a better candidate when the task at hand is part-of-speech tagging. We also explore the impact of multi-task learning in one of our proposed methods, but we observe minor differences in the model performance.
36

Multilingual Speech Emotion Recognition using pretrained models powered by Self-Supervised Learning / Flerspråkig känsloigenkänning från tal med hjälp av förtränade tal-modeller baserat på själv-övervakad Inlärning

Luthman, Felix January 2022 (has links)
Society is based on communication, for which speech is the most prevalent medium. In day to day interactions we talk to each other, but it is not only the words spoken that matters, but the emotional delivery as well. Extracting emotion from speech has therefore become a topic of research in the area of speech tasks. This area as a whole has in recent years adopted a Self- Supervised Learning approach for learning speech representations from raw speech audio, without the need for any supplementary labelling. These speech representations can be leveraged for solving tasks limited by the availability of annotated data, be it for low-resource language, or a general lack of data for the task itself. This thesis aims to evaluate the performances of a set of pre-trained speech models by fine-tuning them in different multilingual environments, and evaluating their performance thereafter. The model presented in this paper is based on wav2vec 2.0 and manages to correctly classify 86.58% of samples over eight different languages and four emotional classes when trained on those same languages. Experiments were conducted to garner how well a model trained on seven languages would perform on the one left out, which showed that there is quite a large margin of similarity in how different cultures express vocal emotions, and further investigations showed that as little as just a few minutes of in-domain data is able to increase the performance substantially. This shows promising results even for niche languages, as the amount of available data may not be as large of a hurdle as one might think. With that said, increasing the amount of data from minutes to hours does still garner substantial improvements, albeit to a lesser degree. / Hela vårt samhälle är byggt på kommunikation mellan olika människor, varav tal är det vanligaste mediet. På en daglig basis interagerar vi genom att prata med varandra, men det är inte bara orden som förmedlar våra intentioner, utan även hur vi uttrycker dem. Till exempel kan samma mening ge helt olika intryck beroende på ifall den sägs med ett argt eller glatt tonfall. Talbaserad forskning är ett stort vetenskapligt område i vilket talbaserad känsloigenkänning vuxit fram. Detta stora tal-område har under de senaste åren sett en tendens att utnyttja en teknik kallad själv-övervakad inlärning för att utnyttja omärkt ljuddata för att lära sig generella språkrepresentationer, vilket kan liknas vid att lära sig strukturen av tal. Dessa representationer, eller förtränade modeller, kan sedan utnyttjas som en bas för att lösa problem med begränsad tillgång till märkt data, vilket kan vara fallet för sällsynta språk eller unika uppgifter. Målet med denna rapport är att utvärdera olika applikationer av denna representations inlärning i en flerspråkig miljö genom att finjustera förtränade modeller för känsloigenkänning. I detta syfte presenterar vi en modell baserad på wav2vec 2.0 som lyckas klassifiera 86.58% av ljudklipp tagna från åtta olika språk över fyra olika känslo-klasser, efter att modellen tränats på dessa språk. För att avgöra hur bra en modell kan klassifiera data från ett språk den inte tränats på skapades modeller tränade på sju språk, och evaluerades sedan på det språk som var kvar. Dessa experiment visar att sättet vi uttrycker känslor mellan olika kulturer är tillräckligt lika för att modellen ska prestera acceptabelt även i det fall då modellen inte sett språket under träningsfasen. Den sista undersökningen utforskar hur olika mängd data från ett språk påverkar prestandan på det språket, och visar att så lite som endast ett par minuter data kan förbättra resultet nämnvärt, vilket är lovande för att utvidga modellen för fler språk i framtiden. Med det sagt är ytterligare data att föredra, då detta medför fortsatta förbättringar, om än i en lägre grad.
37

Zero-Shot Cross-Lingual Domain Adaptation for Neural Machine Translation : Exploring The Interplay Between Language And Domain Transferability

Shahnazaryan, Lia January 2024 (has links)
Within the field of neural machine translation (NMT), transfer learning and domain adaptation techniques have emerged as central solutions to overcome the data scarcity challenges faced by low-resource languages and specialized domains. This thesis explores the potential of zero-shot cross-lingual domain adaptation, which integrates principles of transfer learning across languages and domain adaptation. By fine-tuning a multilingual pre-trained NMT model on domain-specific data from one language pair, the aim is to capture domain-specific knowledge and transfer it to target languages within the same domain, enabling effective zero-shot cross-lingual domain transfer. This study conducts a series of comprehensive experiments across both specialized and mixed domains to explore the feasibility and influencing factors of zero-shot cross-lingual domain adaptation. The results indicate that fine-tuned models generally outperform the pre-trained baseline in specialized domains and most target languages. However, the extent of improvement depends on the linguistic complexity of the domain, as well as the transferability potential driven by the linguistic similarity between the pivot and target languages. Additionally, the study examines zero-shot cross-lingual cross-domain transfer, where models fine-tuned on mixed domains are evaluated on specialized domains. The results reveal that while cross-domain transfer is feasible, its effectiveness depends on the characteristics of the pivot and target domains, with domains exhibiting more consistent language being more responsive to cross-domain transfer. By examining the interplay between language-specific and domain-specific factors, the research explores the dynamics influencing zero-shot cross-lingual domain adaptation, highlighting the significant role played by both linguistic relatedness and domain characteristics in determining the transferability potential.
38

Large-Context Question Answering with Cross-Lingual Transfer

Sagen, Markus January 2021 (has links)
Models based around the transformer architecture have become one of the most prominent for solving a multitude of natural language processing (NLP)tasks since its introduction in 2017. However, much research related to the transformer model has focused primarily on achieving high performance and many problems remain unsolved. Two of the most prominent currently are the lack of high performing non-English pre-trained models, and the limited number of words most trained models can incorporate for their context. Solving these problems would make NLP models more suitable for real-world applications, improving information retrieval, reading comprehension, and more. All previous research has focused on incorporating long-context for English language models. This thesis investigates the cross-lingual transferability between languages when only training for long-context in English. Training long-context models in English only could make long-context in low-resource languages, such as Swedish, more accessible since it is hard to find such data in most languages and costly to train for each language. This could become an efficient method for creating long-context models in other languages without the need for such data in all languages or pre-training from scratch. We extend the models’ context using the training scheme of the Longformer architecture and fine-tune on a question-answering task in several languages. Our evaluation could not satisfactorily confirm nor deny if transferring long-term context is possible for low-resource languages. We believe that using datasets that require long-context reasoning, such as a multilingual TriviaQAdataset, could demonstrate our hypothesis’s validity.
39

Measuring Semantic Distance using Distributional Profiles of Concepts

Mohammad, Saif 01 August 2008 (has links)
Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine translation and word sense disambiguation, can be viewed as semantic distance problems. The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each. This thesis argues that estimating semantic distance is essentially a property of concepts (rather than words) and that two concepts are semantically close if they occur in similar contexts. Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus). The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks. I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems in one, possibly resource-poor, language using a knowledge source from another, possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization. The proposed approach is computationally inexpensive, it can estimate both semantic relatedness and semantic similarity, and it can be applied to all parts of speech. Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and word translation show that the new approach is markedly superior to previous ones.
40

Measuring Semantic Distance using Distributional Profiles of Concepts

Mohammad, Saif 01 August 2008 (has links)
Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine translation and word sense disambiguation, can be viewed as semantic distance problems. The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each. This thesis argues that estimating semantic distance is essentially a property of concepts (rather than words) and that two concepts are semantically close if they occur in similar contexts. Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus). The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks. I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems in one, possibly resource-poor, language using a knowledge source from another, possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization. The proposed approach is computationally inexpensive, it can estimate both semantic relatedness and semantic similarity, and it can be applied to all parts of speech. Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and word translation show that the new approach is markedly superior to previous ones.

Page generated in 0.0757 seconds