• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 103
  • 8
  • 5
  • 4
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 153
  • 153
  • 73
  • 61
  • 53
  • 52
  • 44
  • 39
  • 36
  • 29
  • 26
  • 26
  • 20
  • 17
  • 17
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
111

Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource Scenarios

Eskander, Ramy January 2021 (has links)
With the high cost of manually labeling data and the increasing interest in low-resource languages, for which human annotators might not be even available, unsupervised approaches have become essential for processing a typologically diverse set of languages, whether high-resource or low-resource. In this work, we propose new fully unsupervised approaches for two tasks in morphology: unsupervised morphological segmentation and unsupervised cross-lingual part-of-speech (POS) tagging, which have been two essential subtasks for several downstream NLP applications, such as machine translation, speech recognition, information extraction and question answering. We propose a new unsupervised morphological-segmentation approach that utilizes Adaptor Grammars (AGs), nonparametric Bayesian models that generalize probabilistic context-free grammars (PCFGs), where a PCFG models word structure in the task of morphological segmentation. We implement the approach as a publicly available morphological-segmentation framework, MorphAGram, that enables unsupervised morphological segmentation through the use of several proposed language-independent grammars. In addition, the framework allows for the use of scholar knowledge, when available, in the form of affixes that can be seeded into the grammars. The framework handles the cases when the scholar-seeded knowledge is either generated from language resources, possibly by someone who does not know the language, as weak linguistic priors, or generated by an expert in the underlying language as strong linguistic priors. Another form of linguistic priors is the design of a grammar that models language-dependent specifications. We also propose a fully unsupervised learning setting that approximates the effect of scholar-seeded knowledge through self-training. Moreover, since there is no single grammar that works best across all languages, we propose an approach that picks a nearly optimal configuration (a learning setting and a grammar) for an unseen language, a language that is not part of the development. Finally, we examine multilingual learning for unsupervised morphological segmentation in low-resource setups. For unsupervised POS tagging, two cross-lingual approaches have been widely adapted: 1) annotation projection, where POS annotations are projected across an aligned parallel text from a source language for which a POS tagger is accessible to the target one prior to training a POS model; and 2) zero-shot model transfer, where a model of a source language is directly applied on texts in the target language. We propose an end-to-end architecture for unsupervised cross-lingual POS tagging via annotation projection in truly low-resource scenarios that do not assume access to parallel corpora that are large in size or represent a specific domain. We integrate and expand the best practices in alignment and projection and design a rich neural architecture that exploits non-contextualized and transformer-based contextualized word embeddings, affix embeddings and word-cluster embeddings. Additionally, since parallel data might be available between the target language and multiple source ones, as in the case of the Bible, we propose different approaches for learning from multiple sources. Finally, we combine our work on unsupervised morphological segmentation and unsupervised cross-lingual POS tagging by conducting unsupervised stem-based cross-lingual POS tagging via annotation projection, which relies on the stem as the core unit of abstraction for alignment and projection, which is beneficial to low-resource morphologically complex languages. We also examine morpheme-based alignment and projection, the use of linguistic priors towards better POS models and the use of segmentation information as learning features in the neural architecture. We conduct comprehensive evaluation and analysis to assess the performance of our approaches of unsupervised morphological segmentation and unsupervised POS tagging and show that they achieve the state-of-the-art performance for the two morphology tasks when evaluated on a large set of languages of different typologies: analytic, fusional, agglutinative and synthetic/polysynthetic.
112

Textual Inference for Machine Comprehension / Inférence textuelle pour la compréhension automatique

Gleize, Martin 07 January 2016 (has links)
Étant donnée la masse toujours croissante de texte publié, la compréhension automatique des langues naturelles est à présent l'un des principaux enjeux de l'intelligence artificielle. En langue naturelle, les faits exprimés dans le texte ne sont pas nécessairement tous explicites : le lecteur humain infère les éléments manquants grâce à ses compétences linguistiques, ses connaissances de sens commun ou sur un domaine spécifique, et son expérience. Les systèmes de Traitement Automatique des Langues (TAL) ne possèdent naturellement pas ces capacités. Incapables de combler les défauts d'information du texte, ils ne peuvent donc pas le comprendre vraiment. Cette thèse porte sur ce problème et présente notre travail sur la résolution d'inférences pour la compréhension automatique de texte. Une inférence textuelle est définie comme une relation entre deux fragments de texte : un humain lisant le premier peut raisonnablement inférer que le second est vrai. Beaucoup de tâches de TAL évaluent plus ou moins directement la capacité des systèmes à reconnaître l'inférence textuelle. Au sein de cette multiplicité de l'évaluation, les inférences elles-mêmes présentent une grande variété de types. Nous nous interrogeons sur les inférences en TAL d'un point de vue théorique et présentons deux contributions répondant à ces niveaux de diversité : une tâche abstraite contextualisée qui englobe les tâches d'inférence du TAL, et une taxonomie hiérarchique des inférences textuelles en fonction de leur difficulté. La reconnaissance automatique d'inférence textuelle repose aujourd'hui presque toujours sur un modèle d'apprentissage, entraîné à l'usage de traits linguistiques variés sur un jeu d'inférences textuelles étiquetées. Cependant, les données spécifiques aux phénomènes d'inférence complexes ne sont pour le moment pas assez abondantes pour espérer apprendre automatiquement la connaissance du monde et le raisonnement de sens commun nécessaires. Les systèmes actuels se concentrent plutôt sur l'apprentissage d'alignements entre les mots de phrases reliées sémantiquement, souvent en utilisant leur structure syntaxique. Pour étendre leur connaissance du monde, ils incluent des connaissances tirées de ressources externes, ce qui améliore souvent les performances. Mais cette connaissance est souvent ajoutée par dessus les fonctionnalités existantes, et rarement bien intégrée à la structure de la phrase.Nos principales contributions dans cette thèse répondent au problème précédent. En partant de l'hypothèse qu'un lexique plus simple devrait rendre plus facile la comparaison du sens de deux phrases, nous décrivons une méthode de récupération de passage fondée sur une expansion lexicale structurée et un dictionnaire de simplifications. Cette hypothèse est testée à nouveau dans une de nos contributions sur la reconnaissance d'implication textuelle : des paraphrases syntaxiques sont extraites du dictionnaire et appliquées récursivement sur la première phrase pour la transformer en la seconde. Nous présentons ensuite une méthode d'apprentissage par noyaux de réécriture de phrases, avec une notion de types permettant d'encoder des connaissances lexico-sémantiques. Cette approche est efficace sur trois tâches : la reconnaissance de paraphrases, d'implication textuelle, et le question-réponses. Nous résolvons son problème de passage à l'échelle dans une dernière contribution. Des tests de compréhension sont utilisés pour son évaluation, sous la forme de questions à choix multiples sur des textes courts, qui permettent de tester la résolution d'inférences en contexte. Notre système est fondé sur un algorithme efficace d'édition d'arbres, et les traits extraits des séquences d'édition sont utilisés pour construire deux classifieurs pour la validation et l'invalidation des choix de réponses. Cette approche a obtenu la deuxième place du challenge "Entrance Exams" à CLEF 2015. / With the ever-growing mass of published text, natural language understanding stands as one of the most sought-after goal of artificial intelligence. In natural language, not every fact expressed in the text is necessarily explicit: human readers naturally infer what is missing through various intuitive linguistic skills, common sense or domain-specific knowledge, and life experiences. Natural Language Processing (NLP) systems do not have these initial capabilities. Unable to draw inferences to fill the gaps in the text, they cannot truly understand it. This dissertation focuses on this problem and presents our work on the automatic resolution of textual inferences in the context of machine reading. A textual inference is simply defined as a relation between two fragments of text: a human reading the first can reasonably infer that the second is true. A lot of different NLP tasks more or less directly evaluate systems on their ability to recognize textual inference. Among this multiplicity of evaluation frameworks, inferences themselves are not one and the same and also present a wide variety of different types. We reflect on inferences for NLP from a theoretical standpoint and present two contributions addressing these levels of diversity: an abstract contextualized inference task encompassing most NLP inference-related tasks, and a novel hierchical taxonomy of textual inferences based on their difficulty.Automatically recognizing textual inference currently almost always involves a machine learning model, trained to use various linguistic features on a labeled dataset of samples of textual inference. However, specific data on complex inference phenomena is not currently abundant enough that systems can directly learn world knowledge and commonsense reasoning. Instead, systems focus on learning how to use the syntactic structure of sentences to align the words of two semantically related sentences. To extend what systems know of the world, they include external background knowledge, often improving their results. But this addition is often made on top of other features, and rarely well integrated to sentence structure. The main contributions of our thesis address the previous concern, with the aim of solving complex natural language understanding tasks. With the hypothesis that a simpler lexicon should make easier to compare the sense of two sentences, we present a passage retrieval method using structured lexical expansion backed up by a simplifying dictionary. This simplification hypothesis is tested again in a contribution on textual entailment: syntactical paraphrases are extracted from the same dictionary and repeatedly applied on the first sentence to turn it into the second. We then present a machine learning kernel-based method recognizing sentence rewritings, with a notion of types able to encode lexical-semantic knowledge. This approach is effective on three tasks: paraphrase identification, textual entailment and question answering. We address its lack of scalability while keeping most of its strengths in our last contribution. Reading comprehension tests are used for evaluation: these multiple-choice questions on short text constitute the most practical way to assess textual inference within a complete context. Our system is founded on a efficient tree edit algorithm, and the features extracted from edit sequences are used to build two classifiers for the validation and invalidation of answer candidates. This approach reaches second place at the "Entrance Exams" CLEF 2015 challenge.
113

Znalec encyklopedie / Encyclopedia Expert

Krč, Martin January 2009 (has links)
This project focuses on a system that answers questions formulated in natural language. Firstly, the report discusses problems associated with question answering systems and some commonly employed approaches. Emphasis is laid on shallow methods, which do not require many linguistic resources. The second part describes our work on a system that answers factoid questions, utilizing Czech Wikipedia as a source of information. Answer extraction is partly based on specific features of Wikipedia and partly on pre-defined patterns. Results show that for answering simple questions, the system provides significant improvements in comparison with a standard search engine.
114

Neural Network Models for Tasks in Open-Domain and Closed-Domain Question Answering

Chen, Charles L. 01 June 2020 (has links)
No description available.
115

On Advancing Natural Language Interfaces: Data Collection, Model Development, and User Interaction

Yao, Ziyu January 2021 (has links)
No description available.
116

Using Bidirectional Encoder Representations from Transformers for Conversational Machine Comprehension / Användning av BERT-språkmodell för konversationsförståelse

Gogoulou, Evangelina January 2019 (has links)
Bidirectional Encoder Representations from Transformers (BERT) is a recently proposed language representation model, designed to pre-train deep bidirectional representations, with the goal of extracting context-sensitive features from an input text [1]. One of the challenging problems in the field of Natural Language Processing is Conversational Machine Comprehension (CMC). Given a context passage, a conversational question and the conversational history, the system should predict the answer span of the question in the context passage. The main challenge in this task is how to effectively encode the conversational history into the prediction of the next answer. In this thesis work, we investigate the use of the BERT language model for the CMC task. We propose a new architecture, named BERT-CMC, using the BERT model as a base. This architecture includes a new module for encoding the conversational history, inspired by the Transformer-XL model [2]. This module serves the role of memory throughout the conversation. The proposed model is trained and evaluated on the Conversational Question Answering dataset (CoQA) [3]. Our hypothesis is that the BERT-CMC model will effectively learn the underlying context of the conversation, leading to better performance than the baseline model proposed for CoQA. Our results of evaluating the BERT-CMC on the CoQA dataset show that the model performs poorly (44.7% F1 score), comparing to the CoQA baseline model (66.2% F1 score). In the light of model explainability, we also perform a qualitative analysis of the model behavior in questions with various linguistic phenomena eg coreference, pragmatic reasoning. Additionally, we motivate the critical design choices made, by performing an ablation study of the effect of these choices on the model performance. The results suggest that fine tuning the BERT layers boost the model performance. Moreover, it is shown that increasing the number of extra layers on top of BERT leads to bigger capacity of the conversational memory. / Bidirectional Encoder Representations from Transformers (BERT) är en nyligen föreslagen språkrepresentationsmodell, utformad för att förträna djupa dubbelriktade representationer, med målet att extrahera kontextkänsliga särdrag från en inmatningstext [1]. Ett utmanande problem inom området naturligtspråkbehandling är konversationsförståelse (förkortat CMC). Givet en bakgrundstext, en fråga och konversationshistoriken ska systemet förutsäga vilken del av bakgrundstexten som utgör svaret på frågan. Den viktigaste utmaningen i denna uppgift är hur man effektivt kan kodifiera konversationshistoriken i förutsägelsen av nästa svar. I detta examensarbete undersöker vi användningen av BERT-språkmodellen för CMC-uppgiften. Vi föreslår en ny arkitektur med namnet BERT-CMC med BERT-modellen som bas. Denna arkitektur innehåller en ny modul för kodning av konversationshistoriken, inspirerad av Transformer-XL-modellen [2]. Den här modulen tjänar minnets roll under hela konversationen. Den föreslagna modellen tränas och utvärderas på en datamängd för samtalsfrågesvar (CoQA) [3]. Vår hypotes är att BERT-CMC-modellen effektivt kommer att lära sig det underliggande sammanhanget för konversationen, vilket leder till bättre resultat än basmodellen som har föreslagits för CoQA. Våra resultat av utvärdering av BERT-CMC på CoQA-datasetet visar att modellen fungerar dåligt (44.7% F1 resultat), jämfört med CoQAbasmodellen (66.2% F1 resultat). För att bättre kunna förklara modellen utför vi också en kvalitativ analys av modellbeteendet i frågor med olika språkliga fenomen, t.ex. koreferens, pragmatiska resonemang. Dessutom motiverar vi de kritiska designvalen som gjorts genom att utföra en ablationsstudie av effekten av dessa val på modellens prestanda. Resultaten tyder på att finjustering av BERT-lager ökar modellens prestanda. Dessutom visas att ökning av antalet extra lager ovanpå BERT leder till större konversationsminne.
117

Synthetic data generation for domain adaptation of a retriever-reader Question Answering system for the Telecom domain : Comparing dense embeddings with BM25 for Open Domain Question Answering / Syntetisk data genering för domänadaptering av ett retriever-readerbaserat frågebesvaringssystem för telekomdomänen : En jämförelse av dense embeddings med BM25 för Öpen Domän frågebesvaring

Döringer Kana, Filip January 2023 (has links)
Having computer systems capable of answering questions has been a goal within Natural Language Processing research for many years. Machine Learning systems have recently become increasingly proficient at this task with large language models obtaining state-of-the-art performance. Retriever-reader architectures have become a powerful approach for building systems that enable users to enter questions and get factual answers from a corpus of documents. This architecture uses a retriever component that fetches the most relevant documents and a reader which in turn extracts the answer from the documents. These systems commonly use transformer-based models for both components, which have been fine-tuned on a general domain of documents, such as Wikipedia. However, the performance of such systems on new domains, with different vocabularies, can be lacking. Furthermore, new domains of, for instance, company-specific documents often lack annotated data which makes training new models cumbersome. This thesis investigated how a retriever-reader-based architecture can be adapted to a corpus of Telecom documents by generating question-answer data using a large generative language model, GPT3.5. Also, it compared the usage of a dense retriever using BERT to a BM25-based retriever on the domain. Findings suggest that generating training data can be an effective approach for fine-tuning a dense retriever, increasing the Top-K retrieval accuracy by 20 points for k = 10, compared to a dense retriever fine-tuned on Wikipedia. Additionally, it is found that the sparse retriever outperforms the best dense retriever, although, there is reason to believe that the structure of the test dataset could influence this. Finally, the results also indicate that the performance of the reader is not improved by the generated data although future work is needed to draw better conclusions. / Datorsystem som kan svara på frågor har varit ett mål inom forskningsfältet naturlig språkbehandling i många år. System som använder sig av maskininlärning, så som stora språkmodeller har under de senaste åren uppnått hög prestanda. Att använda sig av en så kallad retriever-reader arkitektur har blivit ett kraftfullt tillvägagångssätt för att bygga system som gör det möjligt för användare att ställa frågor och få faktabaserade svar hämtade från en korpus av dokument. Denna arkitektur använder en retriever som hämtar den mest relevanta informationen och en reader som sedan extraherar ett svar från den hämtade informationen. Dessa system använder vanligtvis transformer-baserade modeller för båda komponenterna, som har tränats på en allmän domän som t.ex., Wikipedia. Dock kan prestandan hos dessa system vara bristfällig när de appliceras på mer specifika domäner med andra ordförråd. Dessutom saknas ofta annoterad data för mer specifika domäner, som exempelvis företagsdokument, vilket gör det svårt att träna modeller på dessa områden. I denna avhandling undersöktes hur en retriever-reader arkitektur kan appliceras på en korpus telekomdokument genom att generera data bestående av frågor och tillhörande svar, genom att använda en stor generativ språkmodell, GPT3.5. Rapporten jämförde även användandet av en BERT-baserad retriever med en BM25-baserad retriever för denna domän. Resultaten tyder på att generering av träningsdata kan vara ett effektivt tillvägagångssätt för att träna en BERT-baserad retriever. Den tränade modellen hade 20 poäng högre noggranhet för måttet Top-K retrieval vid k = 10 jämfört med samma model tränad på data från Wikipedia. Resultaten visade även att en BM25-baserad retriever hade högre noggranhet än den bästa BERT-baserade retrievern som tränats. Dock kan detta bero på datasetets utformning. Slutligen visade resultaten även att prestandan hos en tränad reader inte blev bättre genom att träna på genererad data men denna slutsats kräver framtida arbete för att undersökas mer noggrant.
118

A Study on Effective Approaches for Exploiting Temporal Information in News Archives / ニュースアーカイブの時制情報活用のための有効な手法に関する研究

Wang, Jiexin 26 September 2022 (has links)
京都大学 / 新制・課程博士 / 博士(情報学) / 甲第24259号 / 情博第803号 / 新制||情||135(附属図書館) / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授 吉川 正俊, 教授 田島 敬史, 教授 黒橋 禎夫, 特定准教授 LIN Donghui / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
119

Investigating the Effect of Complementary Information Stored in Multiple Languages on Question Answering Performance : A Study of the Multilingual-T5 for Extractive Question Answering / Vad är effekten av kompletterande information lagrad i flera språk på frågebesvaring : En undersökning av multilingual-T5 för frågebesvaring

Aurell Hansson, Björn January 2021 (has links)
Extractive question answering is a popular domain in the field of natural language processing, where machine learning models are tasked with answering questions given a context. Historically the field has been centered on monolingual models, but recently more and more multilingual models have been developed, such as Google’s MT5 [1]. Because of this, machine translations of English have been used when training and evaluating these models, but machine translations can be degraded and do not always reflect their target language fairly. This report investigates if complementary information stored in other languages can improve monolingual QA performance for languages where only machine translations are available. It also investigates if exposure to more languages can improve zero-shot cross-lingual QA performance (i.e. when the question and answer do not have matching languages) by providing complementary information. We fine-tune 3 different MT5 models on QA datasets consisting of machine translations, as well as one model on the datasets together in combination with 3 other datasets that are not translations. We then evaluate the different models on the MLQA and XQuAD datasets. The results show that for 2 out of the 3 languages evaluated, complementary information stored in other languages had a positive effect on the QA performance of the MT5. For zero-shot cross-lingual QA, the complementary information offered by the fused model lead to improved performance compared to 2/3 of the MT5 models trained only on translated data, indicating that complementary information from other languages do not offer any improvement in this regard. / Frågebesvaring (QA) är en populär domän inom naturlig språkbehandling, där maskininlärningsmodeller har till uppgift att svara på frågor. Historiskt har fältet varit inriktat på enspråkiga modeller, men nyligen har fler och fler flerspråkiga modeller utvecklats, till exempel Googles MT5 [1]. På grund av detta har maskinöversättningar av engelska använts vid träning och utvärdering av dessa modeller, men maskinöversättningar kan vara försämrade och speglar inte alltid deras målspråk rättvist. Denna rapport undersöker om kompletterande information som lagras i andra språk kan förbättra enspråkig QA-prestanda för språk där endast maskinöversättningar är tillgängliga. Den undersöker också om exponering för fler språk kan förbättra QA-prestanda på zero-shot cross-lingual QA (dvs. där frågan och svaret inte har matchande språk) genom att tillhandahålla kompletterande information. Vi finjusterar 3 olika modeller på QA-datamängder som består av maskinöversättningar, samt en modell på datamängderna tillsammans i kombination med 3 andra datamängder som inte är översättningar. Vi utvärderar sedan de olika modellerna på MLQA- och XQuAD-datauppsättningarna. Resultaten visar att för 2 av de 3 utvärderade språken hade kompletterande information som lagrats i andra språk en positiv effekt på QA-prestanda. För zero-shot cross-lingual QA leder den kompletterande informationen som erbjuds av den sammansmälta modellen till förbättrad prestanda jämfört med 2/3 av modellerna som tränats endast på översättningar, vilket indikerar att kompletterande information från andra språk inte ger någon förbättring i detta avseende.
120

Retrieving Definitions from Scientific Text in the Salmon Fish Domain by Lexical Pattern Matching

Gabbay, Igal 01 1900 (has links)
While an information retrieval system takes as input a user query and returns a list of relevant documents chosen from a large collection, a question answering system attempts to produce an exact answer. Recent research, motivated by the question answering track of the Text REtrieval Conference (TREC) has focused mainly on answering ‘factoid’ questions concerned with names, places, dates etc. in the news domain. However, questions seeking definitions of terms are common in the logs of search engines. The objective of this project was therefore to investigate methods of retrieving definitions from scientific documents. The subject domain was salmon, and an appropriate test collection of articles was created, pre-processed and indexed. Relevant terms were obtained from salmon researchers and a fish database. A system was built which accepted a term as input, retrieved relevant documents from the collection using a search engine, identified definition phrases within them using a vocabulary of syntactic patterns and associated heuristics, and produced as output phrases explaining the term. Four experiments were carried out which progressively extended and refined the patterns. The performance of the system, measured using an appropriate form of precision, improved over the experiments from 8.6% to 63.6%. The main findings of the research were: (1) Definitions were diverse despite the documents’ homogeneity and found not only in the Introduction and Abstract sections but also in the Methods and References; (2) Nevertheless, syntactic patterns were a useful starting point in extracting them; (3) Three patterns accounted for 90% of candidate phrases; (4) Statistically, the ordinal number of the instance of the term in a document was a better indicator of the presence of a definition than either sentence position and length, or the number of sentences in the document. Next steps include classifying terms, using information extraction-like templates, resolving basic anaphors, ranking answers, exploiting the structure of scientific papers, and refining the evaluation process.

Page generated in 0.1684 seconds