51 |
Surface Realization Using a Featurized Syntactic Statistical Language ModelPacker, Thomas L. 13 March 2006 (has links)
An important challenge in natural language surface realization is the generation of grammatical sentences from incomplete sentence plans. Realization can be broken into a two-stage process consisting of an over-generating rule-based module followed by a ranker that outputs the most probable candidate sentence based on a statistical language model. Thus far, an n-gram language model has been evaluated in this context. More sophisticated syntactic knowledge is expected to improve such a ranker. In this thesis, a new language model based on featurized functional dependency syntax was developed and evaluated. Generation accuracies and cross-entropy for the new language model did not beat the comparison bigram language model.
52 |
Génération de résumés par abstractionGenest, Pierre-Étienne 05 1900 (has links)
Cette thèse présente le résultat de plusieurs années de recherche dans le domaine de la génération automatique de résumés. Trois contributions majeures, présentées sous la forme d'articles publiés ou soumis pour publication, en forment le coeur. Elles retracent un cheminement qui part des méthodes par extraction en résumé jusqu'aux méthodes par abstraction.
L'expérience HexTac, sujet du premier article, a d'abord été menée pour évaluer le niveau de performance des êtres humains dans la rédaction de résumés par extraction de phrases. Les résultats montrent un écart important entre la performance humaine sous la contrainte d'extraire des phrases du texte source par rapport à la rédaction de résumés sans contrainte. Cette limite à la rédaction de résumés par extraction de phrases, observée empiriquement, démontre l'intérêt de développer d'autres approches automatiques pour le résumé.
Nous avons ensuite développé un premier système selon l'approche Fully Abstractive Summarization, qui se situe dans la catégorie des approches semi-extractives, comme la compression de phrases et la fusion de phrases. Le développement et l'évaluation du système, décrits dans le second article, ont permis de constater le grand défi de générer un résumé facile à lire sans faire de l'extraction de phrases. Dans cette approche, le niveau de compréhension du contenu du texte source demeure insuffisant pour guider le processus de sélection du contenu pour le résumé, comme dans les approches par extraction de phrases.
Enfin, l'approche par abstraction basée sur des connaissances nommée K-BABS est proposée dans un troisième article. Un repérage des éléments d'information pertinents est effectué, menant directement à la génération de phrases pour le résumé. Cette approche a été implémentée dans le système ABSUM, qui produit des résumés très courts mais riches en contenu. Ils ont été évalués selon les standards d'aujourd'hui et cette évaluation montre que des résumés hybrides formés à la fois de la sortie d'ABSUM et de phrases extraites ont un contenu informatif significativement plus élevé qu'un système provenant de l'état de l'art en extraction de phrases. / This Ph.D. thesis is the result of several years of research on automatic text summarization. Three major contributions are presented in the form of published and submitted papers. They follow a path that moves away from extractive summarization and toward abstractive summarization.
The first article describes the HexTac experiment, which was conducted to evaluate the performance of humans summarizing text by extracting sentences. Results show a wide gap of performance between human summaries written by sentence extraction and those written without restriction. This empirical performance ceiling to sentence extraction demonstrates the need for new approaches to text summarization.
We then developed and implemented a system, which is the subject of the second article, using the Fully Abstractive Summarization approach. Though the name suggests otherwise, this approach is better categorized as semi-extractive, along with sentence compression and sentence fusion. Building and evaluating this system brought to light the great challenge associated with generating easily readable summaries without extracting sentences. In this approach, text understanding is not deep enough to provide help in the content selection process, as is the case in extractive summarization.
As the third contribution, a knowledge-based approach to abstractive summarization called K-BABS was proposed. Relevant content is identified by pattern matching on an analysis of the source text, and rules are applied to directly generate sentences for the summary. This approach is implemented in a system called ABSUM, which generates very short and content-rich summaries. An evaluation was performed according to today's standards. The evaluation shows that hybrid summaries generated by adding extracted sentences to ABSUM's output have significantly more content than a state-of-the-art extractive summarizer.
53 |
Intégration de VerbNet dans un réalisateur profondGalarreta-Piquette, Daniel 08 1900 (has links)
No description available.
54 |
Recurrent neural network language generation for dialogue systemsWen, Tsung-Hsien January 2018 (has links)
Language is the principal medium for ideas, while dialogue is the most natural and effective way for humans to interact with and access information from machines. Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact on usability and perceived quality. Many commonly used NLG systems employ rules and heuristics, which tend to generate inflexible and stylised responses without the natural variation of human language. However, the frequent repetition of identical output forms can quickly make dialogue become tedious for most real-world users. Additionally, these rules and heuristics are not scalable and hence not trivially extensible to other domains or languages. A statistical approach to language generation can learn language decisions directly from data without relying on hand-coded rules or heuristics, which brings scalability and flexibility to NLG. Statistical models also provide an opportunity to learn in-domain human colloquialisms and cross-domain model adaptations. A robust and quasi-supervised NLG model is proposed in this thesis. The model leverages a Recurrent Neural Network (RNN)-based surface realiser and a gating mechanism applied to input semantics. The model is motivated by the Long-Short Term Memory (LSTM) network. The RNN-based surface realiser and gating mechanism use a neural network to learn end-to-end language generation decisions from input dialogue act and sentence pairs; it also integrates sentence planning and surface realisation into a single optimisation problem. The single optimisation not only bypasses the costly intermediate linguistic annotations but also generates more natural and human-like responses. Furthermore, a domain adaptation study shows that the proposed model can be readily adapted and extended to new dialogue domains via a proposed recipe. Continuing the success of end-to-end learning, the second part of the thesis speculates on building an end-to-end dialogue system by framing it as a conditional generation problem. The proposed model encapsulates a belief tracker with a minimal state representation and a generator that takes the dialogue context to produce responses. These features suggest comprehension and fast learning. The proposed model is capable of understanding requests and accomplishing tasks after training on only a few hundred human-human dialogues. A complementary Wizard-of-Oz data collection method is also introduced to facilitate the collection of human-human conversations from online workers. The results demonstrate that the proposed model can talk to human judges naturally, without any difficulty, for a sample application domain. In addition, the results also suggest that the introduction of a stochastic latent variable can help the system model intrinsic variation in communicative intention much better.
55 |
Sobre o uso da gramática de dependência extensível na geração de língua natural: questões de generalidade, instanciabilidade e complexidade / On the application of extensible dependency grammar to natural language generation: generality, instantiability and complexity issuesJorge Marques Pelizzoni 29 August 2008 (has links)
A Geração de Língua Natural (GLN) ocupa-se de atribuir forma lingüística a dados em representação não-lingüística (Reiter & Dale, 2000); a Realização Lingüística (RL), por sua vez, reúne as subtarefas da GLN estritamente dependentes das especificidades da língua-alvo. Este trabalho objetiva a investigação em RL, uma de cujas aplicações mais proeminentes é a construção de módulos geradores de língua-alvo na tradução automática baseada em transferência semântica. Partimos da identificação de três requisitos fundamentais para modelos de RL quais sejam generalidade, instanciabilidade e complexidade e da tensão entre esses requisitos no estado da arte. Argumentamos pela relevância da avaliação formal dos modelos da literatura contra esses critérios e focalizamos em modelos baseados em restrições (Schulte, 2002) como promissores para reconciliar os três requisitos. Nesta classe de modelos, identificamos o recente modelo de Debusmann (2006) Extensible Dependency Grammar (XDG) e sua implementação - o XDG Development Toolkit (XDK) - como uma plataforma especialmente promissora para o desenvolvimento em RL, apesar de jamais utilizada para tal. Nossas contribuições práticas se resumem ao esforço de tornar o XDK mais eficiente e uma formulação da disjunção inerente à lexicalização adequada à XDG, demonstrando suas potenciais vantagens numa sistema de GLN mais completo / Natural Language Generation (NLG) concerns assigning linguistic form to data in nonlinguistic representation (Reiter & Dale, 2000); Linguistic Realization (LR), in turn, comprises all strictly target language-dependent NLG tasks. This work looks into RL systems from the perspective of three fundamental requirements - namely generality, instantiability, and complexity and the tension between them in the state of the art. We argue for the formal evaluation of models against these criteria and focus on constraint-based models (Schulte, 2002) as tools to reconcile them. In this class of models we identify the recent development of Debusmann (2006) - Extensible Dependency Grammar (XDG) - and its implementation - the XDG Development Toolkit (XDK) - as an especially promising platform for RL work, in spite of never having been used as such. Our practical contributions comprehend a successful effort to make the XDK more efficient and a formulation of lexicalization disjunction suitable to XDG, illustrating its potential advantages in a full-fledged NLG system
56 |
A Comparative Study of the Quality between Formality Style Transfer of Sentences in Swedish and English, leveraging the BERT model / En jämförande studie av kvaliteten mellan överföring av formalitetsstil på svenska och engelska meningar, med hjälp av BERT-modellenLindblad, Maria January 2021 (has links)
Formality Style Transfer (FST) is the task of automatically transforming a piece of text from one level of formality to another. Previous research has investigated different methods of performing FST on text in English, but at the time of this project there were to the author’s knowledge no previous studies analysing the quality of FST on text in Swedish. The purpose of this thesis was to investigate how a model trained for FST in Swedish performs. This was done by comparing the quality of a model trained on text in Swedish for FST, to an equivalent model trained on text in English for FST. Both models were implemented as encoder-decoder architectures, warm-started using two pre-existing Bidirectional Encoder Representations from Transformers (BERT) models, pre-trained on Swedish and English text respectively. The two FST models were fine-tuned for both the informal to formal task as well as the formal to informal task, using the Grammarly’s Yahoo Answers Formality Corpus (GYAFC). The Swedish version of GYAFC was created through automatic machine translation of the original English version. The Swedish corpus was then evaluated on the three criteria meaning preservation, formality preservation and fluency preservation. The results of the study indicated that the Swedish model had the capacity to match the quality of the English model but was held back by the inferior quality of the Swedish corpus. The study also highlighted the need for task specific corpus in Swedish. / Överföring av formalitetsstil syftar på uppgiften att automatiskt omvandla ett stycke text från en nivå av formalitet till en annan. Tidigare forskning har undersökt olika metoder för att utföra uppgiften på engelsk text men vid tiden för detta projekt fanns det enligt författarens vetskap inga tidigare studier som analyserat kvaliteten för överföring av formalitetsstil på svensk text. Syftet med detta arbete var att undersöka hur en modell tränad för överföring av formalitetsstil på svensk text presterar. Detta gjordes genom att jämföra kvaliteten på en modell tränad för överföring av formalitetsstil på svensk text, med en motsvarande modell tränad på engelsk text. Båda modellerna implementerades som kodnings-avkodningsmodeller, vars vikter initierats med hjälp av två befintliga Bidirectional Encoder Representations from Transformers (BERT)-modeller, förtränade på svensk respektive engelsk text. De två modellerna finjusterades för omvandling både från informell stil till formell och från formell stil till informell. Under finjusteringen användes en svensk och en engelsk version av korpusen Grammarly’s Yahoo Answers Formality Corpus (GYAFC). Den svenska versionen av GYAFC skapades genom automatisk maskinöversättning av den ursprungliga engelska versionen. Den svenska korpusen utvärderades sedan med hjälp av de tre kriterierna betydelse-bevarande, formalitets-bevarande och flödes-bevarande. Resultaten från studien indikerade att den svenska modellen hade kapaciteten att matcha kvaliteten på den engelska modellen men hölls tillbaka av den svenska korpusens sämre kvalitet. Studien underströk också behovet av uppgiftsspecifika korpusar på svenska.
57 |
Le traitement des locutions en génération automatique de texte multilingueDubé, Michaelle 08 1900 (has links)
La locution est peu étudiée en génération automatique de texte (GAT). Syntaxiquement, elle forme un syntagme, alors que sémantiquement, elle ne constitue qu’une seule unité. Le présent mémoire propose un traitement des locutions en GAT multilingue qui permet d’isoler les constituants de la locution tout en conservant le sens global de celle-ci. Pour ce faire, nous avons élaboré une solution flexible à base de patrons universels d’arbres de dépendances syntaxiques vers lesquels pointent des patrons de locutions propres au français (Pausé, 2017). Notre traitement a été effectué dans le réalisateur de texte profond multilingue GenDR à l’aide des données du Réseau lexical du français (RL-fr). Ce travail a abouti à la création de 36 règles de lexicalisation par patron (indépendantes de la langue) et à un dictionnaire lexical pour les locutions du français. Notre implémentation couvre 2 846 locutions du RL-fr (soit 97,5 %), avec une précision de 97,7 %.
Le mémoire se divise en cinq chapitres, qui décrivent : 1) l’architecture classique en GAT et le traitement des locutions par différents systèmes symboliques ; 2) l’architecture de GenDR, (principalement sa grammaire, ses dictionnaires, son interface sémantique-syntaxe et ses stratégies de lexicalisations) ; 3) la place des locutions dans la phraséologie selon la théorie Sens-Texte, ainsi que le RL-fr et ses patrons syntaxiques linéarisés ; 4) notre implémentation de la lexicalisation par patron des locutions dans GenDR, et 5) notre évaluation de la couverture de la précision de notre implémentation. / Idioms are rarely studied in natural language generation (NLG). Syntactically, they form a phrase, while semantically, they correspond to a single unit. In this master’s thesis, we propose a treatment of idioms in multilingual NLG that enables us to isolate their constituents while preserving their global meaning. To do so, we developed a flexible solution based on universal templates of syntactic dependency trees, onto which we map French-specific idiom patterns (Pausé, 2017). Our work was implemented in Generic Deep Realizer (GenDR) using data from the Réseau lexical du français (RL-fr). This resulted in the creation of 36 template-based lexicalization rules (independent of language) and of a lexical dictionary for French idioms. Our implementation covers 2846 idioms of the RL-fr (i.e., 97.5%), with an accuracy of 97.7%.
We divided our analysis into five chapters, which describe: 1) the classical NLG architecture and the handling of idioms by different symbolic systems; 2) the architecture of GenDR (mainly its grammar, its dictionaries, its semantic-syntactic interface, and its lexicalization strategies); 3) the place of idioms in phraseology according to Meaning-Text Theory (théorie Sens-Texte), the RL-fr and its linearized syntactic patterns; 4) our implementation of the template lexicalization of idioms in GenDR; and 5) our evaluation of the coverage and the precision of our implementation.
58 |
Automatic Question Paraphrasing in Swedish with Deep Generative Models / Automatisk frågeparafrasering på svenska med djupa generativa modellerLindqvist, Niklas January 2021 (has links)
Paraphrase generation refers to the task of automatically generating a paraphrase given an input sentence or text. Paraphrase generation is a fundamental yet challenging natural language processing (NLP) task and is utilized in a variety of applications such as question answering, information retrieval, conversational systems etc. In this study, we address the problem of paraphrase generation of questions in Swedish by evaluating two different deep generative models that have shown promising results on paraphrase generation of questions in English. The first model is a Conditional Variational Autoencoder (C-VAE) and the other model is an extension of the first one where a discriminator network is introduced into the model to form a Generative Adversarial Network (GAN) architecture. In addition to these models, a method not based on machine-learning was implemented to act as a baseline. The models were evaluated using both quantitative and qualitative measures including grammatical correctness and equivalence to source question. The results show that the deep generative models outperformed the baseline across all quantitative metrics. Furthermore, from the qualitative evaluation it was shown that the deep generative models outperformed the baseline at generating grammatically correct sentences, but there was no noticeable difference in terms of equivalence to the source question between the models. / Parafrasgenerering syftar på uppgiften att, utifrån en given mening eller text, automatiskt generera en parafras, det vill säga en annan text med samma betydelse. Parafrasgenerering är en grundläggande men ändå utmanande uppgift inom naturlig språkbehandling och används i en rad olika applikationer som informationssökning, konversionssystem, att besvara frågor givet en text etc. I den här studien undersöker vi problemet med parafrasgenerering av frågor på svenska genom att utvärdera två olika djupa generativa modeller som visat lovande resultat på parafrasgenerering av frågor på engelska. Den första modellen är en villkorsbaserad variationsautokodare (C-VAE). Den andra modellen är också en C-VAE men introducerar även en diskriminator vilket gör modellen till ett generativt motståndarnätverk (GAN). Förutom modellerna presenterade ovan, implementerades även en icke maskininlärningsbaserad metod som en baslinje. Modellerna utvärderades med både kvantitativa och kvalitativa mått inklusive grammatisk korrekthet och likvärdighet mellan parafras och originalfråga. Resultaten visar att de djupa generativa modellerna presterar bättre än baslinjemodellen på alla kvantitativa mätvärden. Vidare, visade the kvalitativa utvärderingen att de djupa generativa modellerna kunde generera grammatiskt korrekta frågor i större utsträckning än baslinjemodellen. Det var däremot ingen större skillnad i semantisk ekvivalens mellan parafras och originalfråga för de olika modellerna.
59 |
<b>Forensic Analysis of Images and Documents</b>Ruiting Shao (18018187) 23 February 2024 (has links)
<p dir="ltr">This thesis involves three topics related to forensic analysis of media data. The first topic is the analysis of images and documents that have been created with a scanner. The goal is to detect and identify scanner model from the scanned images/documents. We propose a deep learning system that can automatically learn the inherent features of the scanned images. This system will produce a scanner model identification and a reliability map for a scanned image. The proposed system has shown promising results in the forensic analysis of scanned images. The second topic is related to forensic integrity of scientific papers. The project is divided into multiple tasks, data collection, image extraction, and manipulation detection. We have constructed a dataset of retracted scientific papers that have been verified to have issues with integrity. We design and maintain a web-based Scientific Integrity System for forensic analysis of the images within scientific publications. The third topic is related to media document analysis. Our goal is to identify the publication style for media document, aiding in the potential document manipulation. We are mainly focusing on image-text consistency check, and synthetic tweets analysis. For image-text inconsistency check, we describe a system that can examine an image in document and the corresponding text caption (or other associated text with the image) to check the image/text consistency. For synthetic tweets analysis, we propose a system to detect and identify the text generation models and paraphrase attack models.</p>
60 |
Fine-Tuning Pre-Trained Language Models for CEFR-Level and Keyword Conditioned Text Generation : A comparison between Google’s T5 and OpenAI’s GPT-2 / Finjustering av förtränade språkmodeller för CEFR-nivå och nyckelordsbetingad textgenerering : En jämförelse mellan Googles T5 och OpenAIs GPT-2Roos, Quintus January 2022 (has links)
This thesis investigates the possibilities of conditionally generating English sentences based on keywords-framing content and different difficulty levels of vocabulary. It aims to contribute to the field of Conditional Text Generation (CTG), a type of Natural Language Generation (NLG), where the process of creating text is based on a set of conditions. These conditions include words, topics, content or perceived sentiments. Specifically, it compares the performances of two well-known model architectures: Sequence-toSequence (Seq2Seq) and Autoregressive (AR). These are applied to two different tasks, individual and combined. The Common European Framework of Reference (CEFR) is used to assess the vocabulary level of the texts. In the absence of openly available CEFR-labelled datasets, the author has developed a new methodology with the host company to generate suitable datasets. The generated texts are evaluated on accuracy of the vocabulary levels and readability using readily available formulas. The analysis combines four established readability metrics, and assesses classification accuracy. Both models show a high degree of accuracy when classifying texts into different CEFR-levels. However, the same models are weaker when generating sentences based on a desired CEFR-level. This study contributes empirical evidence suggesting that: (1) Seq2Seq models have a higher accuracy than AR models in generating English sentences based on a desired CEFR-level and keywords; (2) combining Multi-Task Learning (MTL) with instructiontuning is an effective way to fine-tune models on text-classification tasks; and (3) it is difficult to assess the quality of computer generated language using only readability metrics. / I den här studien undersöks möjligheterna att villkorligt generera engelska meningar på så-kallad “naturligt” språk, som baseras på nyckelord, innehåll och vokabulärnivå. Syftet är att bidra till området betingad textgenerering, en underkategori av naturlig textgenerering, vilket är en metod för att skapa text givet vissa ingångsvärden, till exempel ämne, innehåll eller uppfattning. I synnerhet jämförs prestandan hos två välkända modellarkitekturer: sekvenstill-sekvens (Seq2Seq) och autoregressiv (AR). Dessa tillämpas på två uppgifter, såväl individuellt som kombinerat. Den europeiska gemensamma referensramen (CEFR) används för att bedöma texternas vokabulärnivå. I och med avsaknaden av öppet tillgängliga CEFR-märkta dataset har författaren tillsammans med värdföretaget utvecklat en ny metod för att generera lämpliga dataset. De av modellerna genererade texterna utvärderas utifrån vokabulärnivå och läsbarhet samt hur väl de uppfyller den sökta CEFRnivån. Båda modellerna visade en hög träffsäkerhet när de klassificerar texter i olika CEFR-nivåer. Dock uppvisade samma modeller en sämre förmåga att generera meningar utifrån en önskad CEFR-nivå. Denna studie bidrar med empiriska bevis som tyder på: (1) att Seq2Seq-modeller har högre träffsäkerhet än AR-modeller när det gäller att generera engelska meningar utifrån en önskad CEFR-nivå och nyckelord; (2) att kombinera inlärning av multipla uppgifter med instruktionsjustering är ett effektivt sätt att finjustera modeller för textklassificering; (3) att man inte kan bedömma kvaliteten av datorgenererade meningar genom att endast använda läsbarhetsmått.
Page generated in 0.1351 seconds