Spelling suggestions: "subject:"crosslingual btransfer"" "subject:"crosslingual cotransfer""
1 |
Improving Multilingual Models for the Swedish Language : Exploring CrossLingual Transferability and Stereotypical BiasesKatsarou, Styliani January 2021 (has links)
The best performing Transformer-based Language Models are monolingual and mainly focus on high-resource languages such as English. In an attempt to extend their usage to more languages, multilingual models have been introduced. Nevertheless, multilingual models still underperform on a specific language when compared to a similarly sized monolingual model that has been trained solely on that specific language. The main objective of this thesis project is to explore how a multilingual model can be improved for Swedish which is a low-resource language. We study if a multilingual model can benefit from further pre-training on Swedish or on a mix of English and Swedish text before fine-tuning. Our results on the task of semantic text similarity show that further pre-training increases the Pearson Correlation Score by 5% for specific cross-lingual language settings. Taking into account the responsibilities that arise from the increased use of Language Models in real-world applications, we supplement our work by additional experiments that measure stereotypical biases associated to gender. We use a new dataset that we designed specifically for that purpose. Our systematic study compares Swedish to English as well as various model sizes. The insights from our exploration indicate that the Swedish language carries less bias associated to gender than English and that higher manifestation of gender bias is associated to the use of larger Language Models. / De bästa Transformerbaserade språkmodellerna är enspråkiga och fokuserar främst på resursrika språk som engelska. I ett försök att utöka deras användning till fler språk har flerspråkiga modeller introducerats. Flerspråkiga modeller underpresterar dock fortfarande på enskilda språk när man jämför med en enspråkig modell av samma storlek som enbart har tränats på det specifika språket. Huvudsyftet med detta examensarbete är att utforska hur en flerspråkig modell kan förbättras för svenska som är ett resurssnålt språk. Vi studerar om en flerspråkig modell kan dra nytta av ytterligare förträning på svenska eller av en blandning av engelsk och svensk text innan finjustering. Våra resultat på uppgiften om semantisk textlikhet visar att ytterligare förträning ökar Pearsons korrelationspoäng med 5% för specifika tvärspråkiga språkinställningar. Med hänsyn till det ansvar som uppstår från den ökade användningen av språkmodeller i verkliga tillämpningar, kompletterar vi vårt arbete med ytterligare experiment som mäter stereotypa fördomar kopplade till kön. Vi använder en ny datauppsättning som vi har utformat specifikt för det ändamålet. Vår systematiska studie jämför svenska med engelska samt olika modellstorlekar. Insikterna från vår forskning tyder på att det svenska språket har mindre partiskhet förknippat med kön än engelska, samt att högre manifestation av könsfördomar är förknippat med användningen av större språkmodeller.
|
2 |
Exploring Cross-Lingual Transfer Learning for Swedish Named Entity Recognition : Fine-tuning of English and Multilingual Pre-trained Models / Utforskning av tvärspråklig överföringsinlärning för igenkänning av namngivna enheter på svenskaLai Wikström, Daniel, Sparr, Axel January 2023 (has links)
Named Entity Recognition (NER) is a critical task in Natural Language Processing (NLP), and recent advancements in language model pre-training have significantly improved its performance. However, this improvement is not universally applicable due to a lack of large pre-training datasets or computational budget for smaller languages. This study explores the viability of fine-tuning an English and a multilingual model on a Swedish NER task, compared to a model trained solely on Swedish. Our methods involved training these models and measuring their performance using the F1-score metric. Despite fine-tuning, the Swedish model outperformed both the English and multilingual models by 3.0 and 9.0 percentage points, respectively. The performance gap between the English and Swedish models during fine-tuning decreased from 19.8 to 9.0 percentage points. This suggests that while the Swedish model achieved the best performance, fine-tuning can substantially enhance the performance of English and multilingual models for Swedish NER tasks. / Inom området för Natural Language Processing (NLP) är identifiering av namngivna entiteter (NER) en viktig problemtyp. Tack vare senaste tidens framsteg inom förtränade språkmodeller har modellernas prestanda på problemtypen ökat kraftigt. Denna förbättring kan dock inte tillämpas överallt på grund av en brist på omfattande dataset för förträning eller tillräcklig datorkraft för mindre språk. I denna studie undersöks potentialen av fine-tuning på både en engelsk, en svensk och en flerspråkig modell för en svensk NER-uppgift. Dessa modeller tränades och deras effektivitet bedömdes genom att använda F1-score som mått på prestanda. Även med fine-tuning var den svenska modellen bättre än både den engelska och flerspråkiga modellen, med en skillnad på 3,0 respektive 9,0 procentenheter i F1-score. Skillnaden i prestandan mellan den engelska och svenska modellen minskade från 19,8 till 9,0 procentenheter efter fine-tuning. Detta indikerar att även om den svenska modellen var mest framgångsrik, kan fine-tuning av engelska och flerspråkiga modeller betydligt förbättra prestandan för svenska NER-uppgifter.
|
3 |
Monolingual and Cross-Lingual Survey Response AnnotationZhao, Yahui January 2023 (has links)
Multilingual natural language processing (NLP) is increasingly recognized for its potential in processing diverse text-type data, including those from social media, reviews, and technical reports. Multilingual language models like mBERT and XLM-RoBERTa (XLM-R) play a pivotal role in multilingual NLP. Notwithstanding their capabilities, the performance of these models largely relies on the availability of annotated training data. This thesis employs the multilingual pre-trained model XLM-R to examine its efficacy in sequence labelling to open-ended questions on democracy across multilingual surveys. Traditional annotation practices have been labour-intensive and time-consuming, with limited automation attempts. Previous studies often translated multilingual data into English, bypassing the challenges and nuances of native languages. Our study explores automatic multilingual annotation at the token level for democracy survey responses in five languages: Hungarian, Italian, Polish, Russian, and Spanish. The results reveal promising F1 scores, indicating the feasibility of using multilingual models for such tasks. However, the performance of these models is closely tied to the quality and nature of the training set. This research paves the way for future experiments and model adjustments, underscoring the importance of refining training data and optimizing model techniques for enhanced classification accuracy.
|
4 |
Task-agnostic knowledge distillation of mBERT to Swedish / Uppgiftsagnostisk kunskapsdestillation av mBERT till svenskaKina, Added January 2022 (has links)
Large transformer models have shown great performance in multiple natural language processing tasks. However, slow inference, strong dependency on powerful hardware, and large energy consumption limit their availability. Furthermore, the best-performing models use high-resource languages such as English, which increases the difficulty of using these models for low-resource languages. Research into compressing large transformer models has been successful, using methods such as knowledge distillation. In this thesis, an existing task-agnostic knowledge distillation method is employed by using Swedish data for distillation of mBERT models further pre-trained on different amounts of Swedish data, in order to obtain a smaller multilingual model with performance in Swedish competitive with a monolingual student model baseline. It is shown that none of the models distilled from a multilingual model outperform the distilled Swedish monolingual model on Swedish named entity recognition and Swedish translated natural language understanding benchmark tasks. It is also shown that further pre-training mBERT does not significantly affect the performance of the multilingual teacher or student models on downstream tasks. The results corroborate previously published results showing that no student model outperforms its teacher. / Stora transformator-modeller har uppvisat bra prestanda i flera olika uppgifter inom naturlig bearbetning av språk. Men långsam inferensförmåga, starkt beroende av kraftfull hårdvara och stor energiförbrukning begränsar deras tillgänglighet. Dessutom använder de bäst presterande modellerna högresursspråk som engelska, vilket ökar svårigheten att använda dessa modeller för lågresursspråk. Forskning om att komprimera dessa stora transformatormodeller har varit framgångsrik, med metoder som kunskapsdestillation. I denna avhandling används en existerande uppgiftsagnostisk kunskapsdestillationsmetod genom att använda svensk data för destillation av mBERT modeller vidare förtränade på olika mängder svensk data för att få fram en mindre flerspråkig modell med prestanda på svenska konkurrerande med en enspråkig elevmodell baslinje. Det visas att ingen av modellerna destillerade från en flerspråkig modell överträffar den destillerade svenska enspråkiga modellen på svensk namngiven enhetserkännande och svensk översatta naturlig språkförståelse benchmark uppgifter. Det visas också att ytterligare förträning av mBERTpåverkar inte väsentligt prestandan av de flerspråkiga lärar- eller elevmodeller för nedströmsuppgifter. Resultaten bekräftar tidigare publicerade resultat som visar att ingen elevmodell överträffar sin lärare.
|
5 |
Modèles exponentiels et contraintes sur les espaces de recherche en traduction automatique et pour le transfert cross-lingue / Log-linear Models and Search Space Constraints in Statistical Machine Translation and Cross-lingual TransferPécheux, Nicolas 27 September 2016 (has links)
La plupart des méthodes de traitement automatique des langues (TAL) peuvent être formalisées comme des problèmes de prédiction, dans lesquels on cherche à choisir automatiquement l'hypothèse la plus plausible parmi un très grand nombre de candidats. Malgré de nombreux travaux qui ont permis de mieux prendre en compte la structure de l'ensemble des hypothèses, la taille de l'espace de recherche est généralement trop grande pour permettre son exploration exhaustive. Dans ce travail, nous nous intéressons à l'importance du design de l'espace de recherche et étudions l'utilisation de contraintes pour en réduire la taille et la complexité. Nous nous appuyons sur l'étude de trois problèmes linguistiques — l'analyse morpho-syntaxique, le transfert cross-lingue et le problème du réordonnancement en traduction — pour mettre en lumière les risques, les avantages et les enjeux du choix de l'espace de recherche dans les problèmes de TAL.Par exemple, lorsque l'on dispose d'informations a priori sur les sorties possibles d'un problème d'apprentissage structuré, il semble naturel de les inclure dans le processus de modélisation pour réduire l'espace de recherche et ainsi permettre une accélération des traitements lors de la phase d'apprentissage. Une étude de cas sur les modèles exponentiels pour l'analyse morpho-syntaxique montre paradoxalement que cela peut conduire à d'importantes dégradations des résultats, et cela même quand les contraintes associées sont pertinentes. Parallèlement, nous considérons l'utilisation de ce type de contraintes pour généraliser le problème de l'apprentissage supervisé au cas où l'on ne dispose que d'informations partielles et incomplètes lors de l'apprentissage, qui apparaît par exemple lors du transfert cross-lingue d'annotations. Nous étudions deux méthodes d'apprentissage faiblement supervisé, que nous formalisons dans le cadre de l'apprentissage ambigu, appliquées à l'analyse morpho-syntaxiques de langues peu dotées en ressources linguistiques.Enfin, nous nous intéressons au design de l'espace de recherche en traduction automatique. Les divergences dans l'ordre des mots lors du processus de traduction posent un problème combinatoire difficile. En effet, il n'est pas possible de considérer l'ensemble factoriel de tous les réordonnancements possibles, et des contraintes sur les permutations s'avèrent nécessaires. Nous comparons différents jeux de contraintes et explorons l'importance de l'espace de réordonnancement dans les performances globales d'un système de traduction. Si un meilleur design permet d'obtenir de meilleurs résultats, nous montrons cependant que la marge d'amélioration se situe principalement dans l'évaluation des réordonnancements plutôt que dans la qualité de l'espace de recherche. / Most natural language processing tasks are modeled as prediction problems where one aims at finding the best scoring hypothesis from a very large pool of possible outputs. Even if algorithms are designed to leverage some kind of structure, the output space is often too large to be searched exaustively. This work aims at understanding the importance of the search space and the possible use of constraints to reduce it in size and complexity. We report in this thesis three case studies which highlight the risk and benefits of manipulating the seach space in learning and inference.When information about the possible outputs of a sequence labeling task is available, it may seem appropriate to include this knowledge into the system, so as to facilitate and speed-up learning and inference. A case study on type constraints for CRFs however shows that using such constraints at training time is likely to drastically reduce performance, even when these constraints are both correct and useful at decoding.On the other side, we also consider possible relaxations of the supervision space, as in the case of learning with latent variables, or when only partial supervision is available, which we cast as ambiguous learning. Such weakly supervised methods, together with cross-lingual transfer and dictionary crawling techniques, allow us to develop natural language processing tools for under-resourced languages. Word order differences between languages pose several combinatorial challenges to machine translation and the constraints on word reorderings have a great impact on the set of potential translations that is explored during search. We study reordering constraints that allow to restrict the factorial space of permutations and explore the impact of the reordering search space design on machine translation performance. However, we show that even though it might be desirable to design better reordering spaces, model and search errors seem yet to be the most important issues.
|
6 |
Training parsers for low-resourced languages : improving cross-lingual transfer with monolingual knowledge / Apprentissage d'analyseurs syntaxiques pour les langues peu dotées : amélioration du transfert cross-lingue grâce à des connaissances monolinguesAufrant, Lauriane 06 April 2018 (has links)
Le récent essor des algorithmes d'apprentissage automatique a rendu les méthodes de Traitement Automatique des Langues d'autant plus sensibles à leur facteur le plus limitant : la qualité des systèmes repose entièrement sur la disponibilité de grandes quantités de données, ce qui n'est pourtant le cas que d'une minorité parmi les 7.000 langues existant au monde. La stratégie dite du transfert cross-lingue permet de contourner cette limitation : une langue peu dotée en ressources (la cible) peut être traitée en exploitant les ressources disponibles dans une autre langue (la source). Les progrès accomplis sur ce plan se limitent néanmoins à des scénarios idéalisés, avec des ressources cross-lingues prédéfinies et de bonne qualité, de sorte que le transfert reste inapplicable aux cas réels de langues peu dotées, qui n'ont pas ces garanties. Cette thèse vise donc à tirer parti d'une multitude de sources et ressources cross-lingues, en opérant une combinaison sélective : il s'agit d'évaluer, pour chaque aspect du traitement cible, la pertinence de chaque ressource. L'étude est menée en utilisant l'analyse en dépendance par transition comme cadre applicatif. Le cœur de ce travail est l'élaboration d'un nouveau méta-algorithme de transfert, dont l'architecture en cascade permet la combinaison fine des diverses ressources, en ciblant leur exploitation à l'échelle du mot. L'approche cross-lingue pure n'étant en l'état pas compétitive avec la simple annotation de quelques phrases cibles, c'est avant tout la complémentarité de ces méthodes que souligne l'analyse empirique. Une série de nouvelles métriques permet une caractérisation fine des similarités cross-lingues et des spécificités syntaxiques de chaque langue, de même que de la valeur ajoutée de l'information cross-lingue par rapport au cadre monolingue. L'exploitation d'informations typologiques s'avère également particulièrement fructueuse. Ces contributions reposent largement sur des innovations techniques en analyse syntaxique, concrétisées par la publication en open source du logiciel PanParser, qui exploite et généralise la méthode dite des oracles dynamiques. Cette thèse contribue sur le plan monolingue à plusieurs autres égards, comme le concept de cascades monolingues, pouvant traiter par exemple d'abord toutes les dépendances faciles, puis seulement les difficiles. / As a result of the recent blossoming of Machine Learning techniques, the Natural Language Processing field faces an increasingly thorny bottleneck: the most efficient algorithms entirely rely on the availability of large training data. These technological advances remain consequently unavailable for the 7,000 languages in the world, out of which most are low-resourced. One way to bypass this limitation is the approach of cross-lingual transfer, whereby resources available in another (source) language are leveraged to help building accurate systems in the desired (target) language. However, despite promising results in research settings, the standard transfer techniques lack the flexibility regarding cross-lingual resources needed to be fully usable in real-world scenarios: exploiting very sparse resources, or assorted arrays of resources. This limitation strongly diminishes the applicability of that approach. This thesis consequently proposes to combine multiple sources and resources for transfer, with an emphasis on selectivity: can we estimate which resource of which language is useful for which input? This strategy is put into practice in the frame of transition-based dependency parsing. To this end, a new transfer framework is designed, with a cascading architecture: it enables the desired combination, while ensuring better targeted exploitation of each resource, down to the level of the word. Empirical evaluation dampens indeed the enthusiasm for the purely cross-lingual approach -- it remains in general preferable to annotate just a few target sentences -- but also highlights its complementarity with other approaches. Several metrics are developed to characterize precisely cross-lingual similarities, syntactic idiosyncrasies, and the added value of cross-lingual information compared to monolingual training. The substantial benefits of typological knowledge are also explored. The whole study relies on a series of technical improvements regarding the parsing framework: this work includes the release of a new open source software, PanParser, which revisits the so-called dynamic oracles to extend their use cases. Several purely monolingual contributions complete this work, including an exploration of monolingual cascading, which offers promising perspectives with easy-then-hard strategies.
|
7 |
Zero-Shot Cross-Lingual Domain Adaptation for Neural Machine Translation : Exploring The Interplay Between Language And Domain TransferabilityShahnazaryan, Lia January 2024 (has links)
Within the field of neural machine translation (NMT), transfer learning and domain adaptation techniques have emerged as central solutions to overcome the data scarcity challenges faced by low-resource languages and specialized domains. This thesis explores the potential of zero-shot cross-lingual domain adaptation, which integrates principles of transfer learning across languages and domain adaptation. By fine-tuning a multilingual pre-trained NMT model on domain-specific data from one language pair, the aim is to capture domain-specific knowledge and transfer it to target languages within the same domain, enabling effective zero-shot cross-lingual domain transfer. This study conducts a series of comprehensive experiments across both specialized and mixed domains to explore the feasibility and influencing factors of zero-shot cross-lingual domain adaptation. The results indicate that fine-tuned models generally outperform the pre-trained baseline in specialized domains and most target languages. However, the extent of improvement depends on the linguistic complexity of the domain, as well as the transferability potential driven by the linguistic similarity between the pivot and target languages. Additionally, the study examines zero-shot cross-lingual cross-domain transfer, where models fine-tuned on mixed domains are evaluated on specialized domains. The results reveal that while cross-domain transfer is feasible, its effectiveness depends on the characteristics of the pivot and target domains, with domains exhibiting more consistent language being more responsive to cross-domain transfer. By examining the interplay between language-specific and domain-specific factors, the research explores the dynamics influencing zero-shot cross-lingual domain adaptation, highlighting the significant role played by both linguistic relatedness and domain characteristics in determining the transferability potential.
|
Page generated in 0.0785 seconds