Global ETD Search

31	Compositional entity-level sentiment analysis Moilanen, Karo January 2010 (has links) This thesis presents a computational text analysis tool called AFFECTiS (Affect Interpretation/Inference System) which focuses on the task of interpreting natural language text based on its subjective, non-factual, affective properties that go beyond the 'traditional' factual, objective dimensions of meaning that have so far been the main focus of Natural Language Processing and Computational Linguistics. The thesis presents a fully compositional uniform wide-coverage computational model of sentiment in text that builds on a number of fundamental compositional sentiment phenomena and processes discovered by detailed linguistic analysis of the behaviour of sentiment across key syntactic constructions in English. Driven by the Principle of Semantic Compositionality, the proposed model breaks sentiment interpretation down into strictly binary combinatory steps each of which explains the polarity of a given sentiment expression as a function of the properties of the sentiment carriers contained in it and the grammatical and semantic context(s) involved. An initial implementation of the proposed compositional sentiment model is de- scribed which attempts direct logical sentiment reasoning rather than basing compu- tational sentiment judgements on indirect data-driven evidence. Together with deep grammatical analysis and large hand-written sentiment lexica, the model is applied recursively to assign sentiment to all (sub )sentential structural constituents and to concurrently equip all individual entity mentions with gradient sentiment scores. The system was evaluated on an extensive multi-level and multi-task evaluation framework encompassing over 119,000 test cases from which detailed empirical ex- perimental evidence is drawn. The results across entity-, phrase-, sentence-, word-, and document-level data sets demonstrate that AFFECTiS is capable of human-like sentiment reasoning and can interpret sentiment in a way that is not only coherent syntactically but also defensible logically - even in the presence of the many am- biguous extralinguistic, paralogical, and mixed sentiment anomalies that so tellingly characterise the challenges involved in non-factual classification. 006.35
32	Acquiring syntactic and semantic transformations in question answering Kaisser, Michael January 2010 (has links) One and the same fact in natural language can be expressed in many different ways by using different words and/or a different syntax. This phenomenon, commonly called paraphrasing, is the main reason why Natural Language Processing (NLP) is such a challenging task. This becomes especially obvious in Question Answering (QA) where the task is to automatically answer a question posed in natural language, usually in a text collection also consisting of natural language texts. It cannot be assumed that an answer sentence to a question uses the same words as the question and that these words are combined in the same way by using the same syntactic rules. In this thesis we describe methods that can help to address this problem. Firstly we explore how lexical resources, i.e. FrameNet, PropBank and VerbNet can be used to recognize a wide range of syntactic realizations that an answer sentence to a given question can have. We find that our methods based on these resources work well for web-based Question Answering. However we identify two problems: 1) All three resources as of yet have significant coverage issues. 2) These resources are not suitable to identify answer sentences that show some form of indirect evidence. While the first problem hinders performance currently, it is not a theoretical problem that renders the approach unsuitable–it rather shows that more efforts have to be made to produce more complete resources. The second problem is more persistent. Many valid answer sentences–especially in small, journalistic corpora–do not provide direct evidence for a question, rather they strongly suggest an answer without logically implying it. Semantically motivated resources like FrameNet, PropBank and VerbNet can not easily be employed to recognize such forms of indirect evidence. In order to investigate ways of dealing with indirect evidence, we used Amazon’s Mechanical Turk to collect over 8,000 manually identified answer sentences from the AQUAINT corpus to the over 1,900 TREC questions from the 2002 to 2006 QA tracks. The pairs of answer sentences and their corresponding questions form the QASP corpus, which we released to the public in April 2008. In this dissertation, we use the QASP corpus to develop an approach to QA based on matching dependency relations between answer candidates and question constituents in the answer sentences. By acquiring knowledge about syntactic and semantic transformations from dependency relations in the QASP corpus, additional answer candidates can be identified that could not be linked to the question with our first approach. 006.35
33	Temporal processing of news : annotation of temporal expressions, verbal events and temporal relations Marsic, Georgiana January 2011 (has links) The ability to capture the temporal dimension of a natural language text is essential to many natural language processing applications, such as Question Answering, Automatic Summarisation, and Information Retrieval. Temporal processing is a ¯eld of Computational Linguistics which aims to access this dimension and derive a precise temporal representation of a natural language text by extracting time expressions, events and temporal relations, and then representing them according to a chosen knowledge framework. This thesis focuses on the investigation and understanding of the di®erent ways time is expressed in natural language, on the implementation of a temporal processing system in accordance with the results of this investigation, on the evaluation of the system, and on the extensive analysis of the errors and challenges that appear during system development. The ultimate goal of this research is to develop the ability to automatically annotate temporal expressions, verbal events and temporal relations in a natural language text. Temporal expression annotation involves two stages: temporal expression identi¯cation concerned with determining the textual extent of a temporal expression, and temporal expression normalisation which ¯nds the value that the temporal expression designates and represents it using an annotation standard. The research presented in this thesis approaches these tasks with a knowledge-based methodology that tackles temporal expressions according to their semantic classi¯cation. Several knowledge sources and normalisation models are experimented with to allow an analysis of their impact on system performance. The annotation of events expressed using either ¯nite or non-¯nite verbs is addressed with a method that overcomes the drawback of existing methods v which associate an event with the class that is most frequently assigned to it in a corpus and are limited in coverage by the small number of events present in the corpus. This limitation is overcome in this research by annotating each WordNet verb with an event class that best characterises that verb. This thesis also describes an original methodology for the identi¯cation of temporal relations that hold among events and temporal expressions. The method relies on sentence-level syntactic trees and a propagation of temporal relations between syntactic constituents, by analysing syntactic and lexical properties of the constituents and of the relations between them. The detailed evaluation and error analysis of the methods proposed for solving di®erent temporal processing tasks form an important part of this research. Various corpora widely used by researchers studying di®erent temporal phenomena are employed in the evaluation, thus enabling comparison with state of the art in the ¯eld. The detailed error analysis targeting each temporal processing task helps identify not only problems of the implemented methods, but also reliability problems of the annotated resources, and encourages potential reexaminations of some temporal processing tasks. 006.35
34	Domain independent generation from RDF instance date Sun, Xiantang January 2008 (has links) The next generation of the web, the Semantic Web, integrates distributed web resources from various domains by allowing data (instantial and ontological data) to be shared and reused across applications, enterprise and community boundaries based on the Resource Description Framework (RDF). Nevertheless, the RDF was not developed for casual users who are unfamiliar with the RDF but interested in data represented using RDF. NLG may be a possible solution to bridging the gap between the casual users and RDF data, but the cost of separately applying fine grained NLG techniques for every domain in the Semantic Web would be extremely high, and hence not realistic. 006.35
35	Shared cross-modal associations and the emergence of the lexicon Cuskley, Christine F. January 2013 (has links) This thesis centres around a sensory theory of protolanguage emergence, or STP. The STP proposes that shared biases to make associations between sensory modalities provided the basis for the emergence of a shared protolinguistic lexicon. Crucially, this lexicon would have been grounded in our perceptual systems, and thus fundamentally non-arbitrary. The foundation of such a lexicon lies in shared cross-modal associations: biases shared among language users to map properties in one modality (e.g., visual size) onto another (e.g., vowel sounds). While there is broad evidence that we make associations between a variety of modalities (Spence, 2011), this thesis focuses specifically on associations involving linguistic sound, arguing that these associations would have been most important in language emergence. Early linguistic utterances, by virtue of their grounding in shared cross-modal associations, could be formed and understood with high mutual intelligibility. The first chapter of the thesis will outline this theory in detail, addressing the nature of the proposed protolanguage system, arguing for the utility of non-arbitrariness at the point of language emergence, and proposing evidence for the likely transition form a non-arbitrary protolanguage to the predominantly arbitrary language systems we observe today. The remainder of the thesis will focus on providing empirical evidence to support this theory in two ways: (i) presenting experimental data showing evidence of shared associations between linguistic sound and other modalities, and (ii) providing evidence that such associations are evident cross-linguistically, despite the predominantly arbitrary nature of modern languages. Chapter two will examine well-documented associations between vowel quality and physical size (e.g., /i/ is small, and /a/ is large; Sapir, 1929). This chapter presents a new experimental approach which fails to find robust associations between vowel quality and size absent the use of a forced choice paradigm. Chapter three turns to associations between linguistic sound and shape angularity, taking a critical perspective on the classic takete/maluma experiment (Kohler, 1929). New empirical evidence shows that the acquisition of visual word forms plays a highly influential role in mediating associations between linguistic sound and angularity, but that associations between linguistic sound and visual form also play a minor role in auditory tasks. Chapter four will examine a relatively unexplored modality: taste. A simple survey which asks participants to choose non-words to match representative tastes shows that certain linguistic sounds are preferred for certain food items. In a more detailed study, we use a more direct perceptual matching task with actual tastants and synthesises speech sounds, further showing that people make robust shared associations between linguistic sound and taste. Chapter five returns to the visual modality, considering previously unexmained associations between linguistic sound and motion, specifically the feature of speed. This study demonstrates that people do make robust associations between the two modalities, particularly for vowel quality. Chapter six will aim to take a different empirical approach, considering non-arbitrariness in natural language. Motivated by the experimental data from the previous chapters, we turn to corpus analyses to assess the presence of non-arbitrariness in natural language which concurs with behavioural data showing linguistic cross-modal associations. First, a corpus analysis of taste synonyms in English shows small but significant correlations between form and meaning. With the goal of addressing the universality of specific sound-meaning associations, we examine cross-linguistic corpora of taste and motion terms, showing that particular phonological features tend to connect to certain tastes and types of motion across genetically and geographically distinct languages. Lastly, the thesis will conclude by considering the STP in light of the empirical evidence presented, and suggesting possible future empirical directions to explore the theory more broadly. 006.35
36	Measuring the homogeneity and similarity of language corpora Cavaglia, Gabriela Maria Chiara January 2005 (has links) Corpus-based methods are now dominant in Natural Language Processing (NLP). Creating big corpora is no longer difficult and the technology to analyze them is growing faster, more robust and more accurate. However, when an NLP application performs well on one corpus, it is unclear whether this level of performance would be maintained on others. To make progress on these questions, we need methods for comparing corpora. This thesis investigates comparison methods based on the notions of corpus homogeneity and similarity. 006.35
37	Automatic generation of factual questions from video documentaries Skalban, Yvonne January 2013 (has links) Questioning sessions are an essential part of teachers’ daily instructional activities. Questions are used to assess students’ knowledge and comprehension and to promote learning. The manual creation of such learning material is a laborious and time-consuming task. Research in Natural Language Processing (NLP) has shown that Question Generation (QG) systems can be used to efficiently create high-quality learning materials to support teachers in their work and students in their learning process. A number of successful QG applications for education and training have been developed, but these focus mainly on supporting reading materials. However, digital technology is always evolving; there is an ever-growing amount of multimedia content available, and more and more delivery methods for audio-visual content are emerging and easily accessible. At the same time, research provides empirical evidence that multimedia use in the classroom has beneficial effects on student learning. Thus, there is a need to investigate whether QG systems can be used to assist teachers in creating assessment materials from these different types of media that are being employed in classrooms. This thesis serves to explore how NLP tools and techniques can be harnessed to generate questions from non-traditional learning materials, in particular videos. A QG framework which allows the generation of factual questions from video documentaries has been developed and a number of evaluations to analyse the quality of the produced questions have been performed. The developed framework uses several readily available NLP tools to generate questions from the subtitles accompanying a video documentary. The reason for choosing video vii documentaries is two-fold: firstly, they are frequently used by teachers and secondly, their factual nature lends itself well to question generation, as will be explained within the thesis. The questions generated by the framework can be used as a quick way of testing students’ comprehension of what they have learned from the documentary. As part of this research project, the characteristics of documentary videos and their subtitles were analysed and the methodology has been adapted to be able to exploit these characteristics. An evaluation of the system output by domain experts showed promising results but also revealed that generating even shallow questions is a task which is far from trivial. To this end, the evaluation and subsequent error analysis contribute to the literature by highlighting the challenges QG from documentary videos can face. In a user study, it was investigated whether questions generated automatically by the system developed as part of this thesis and a state-of-the-art system can successfully be used to assist multimedia-based learning. Using a novel evaluation methodology, the feasibility of using a QG system’s output as ‘pre-questions’ with different types of prequestions (text-based and with images) used was examined. The psychometric parameters of the automatically generated questions by the two systems and of those generated manually were compared. The results indicate that the presence of pre-questions (preferably with images) improves the performance of test-takers and they highlight that the psychometric parameters of the questions generated by the system are comparable if not better than those of the state-of-the-art system. In another experiment, the productivity of questions in terms of time taken to generate questions manually vs. time taken to post-edit system-generated questions was analysed. A viii post-editing tool which allows for the tracking of several statistics such as edit distance measures, editing time, etc, was used. The quality of questions before and after postediting was also analysed. Not only did the experiments provide quantitative data about automatically and manually generated questions, but qualitative data in the form of user feedback, which provides an insight into how users perceived the quality of questions, was also gathered. 006.35
38	Implication textuelle et réécriture / Textual Entailment and rewriting Bedaride, Paul 18 October 2010 (has links) Cette thèse propose plusieurs contributions sur le thème de la détection d'implications textuelles (DIT). La DIT est la capacité humaine, étant donné deux textes, à pouvoir dire si le sens du second texte peut être déduit à partir de celui du premier. Une des contributions apportée au domaine est un système de DIT hybride prenant les analyses d'un analyseur syntaxique stochastique existant afin de les étiqueter avec des rôles sémantiques, puis transformant les structures obtenues en formules logiques grâce à des règles de réécriture pour tester finalement l'implication à l'aide d'outils de preuve. L'autre contribution de cette thèse est la génération de suites de tests finement annotés avec une distribution uniforme des phénomènes couplée avec une nouvelle méthode d'évaluation des systèmes utilisant les techniques de fouille d'erreurs développées par la communauté de l'analyse syntaxique permettant une meilleure identification des limites des systèmes. Pour cela nous créons un ensemble de formules sémantiques puis nous générons les réalisations syntaxiques annotées correspondantes à l'aide d'un système de génération existant. Nous testons ensuite s'il y a implication ou non entre chaque couple de réalisations syntaxiques possible. Enfin nous sélectionnons un sous-ensemble de cet ensemble de problèmes d'une taille donnée et satisfaisant un certain nombre de contraintes à l'aide d'un algorithme que nous avons développé. / This thesis presents several contributions on the theme of recognising textual entailment (RTE). The RTE is the human capacity, given two texts, to determine whether the meaning of the second text could be deduced from the meaning of the first or not. One of the contributions made to the field is a hybrid system of RTE taking analysis of an existing stochastic parser to label them with semantics roles, then turning obtained structures in logical formulas using rewrite rules to finally test the entailment using proof tools. Another contribution of this thesis is the generation of finely annotated tests suites with a uniform distribution of phenomena coupled with a new methodology of systems evaluation using error minning techniques developed by the community of parsing allowing better identification of systems limitations. For this, we create a set of formulas, then we generate annotated syntactics realisations corresponding by using an existing generation system. Then, we test whether or not there is an entailment between each pair of possible syntactics realisations. Finally, we select a subset of this set of problems of a given size and a satisfactory a certain number of constraints using an algorithm that we developed Traitement automatique des langues Réécriture Représentation Raisonnement Natural Language Processing Rewriting Representation Reasoning 410.285 006.35
39	Neural-Symbolic Learning for Semantic Parsing / Analyse sémantique avec apprentissage neuro-symbolique Xiao, Chunyang 14 December 2017 (has links) Notre but dans cette thèse est de construire un système qui réponde à une question en langue naturelle (NL) en représentant sa sémantique comme une forme logique (LF) et ensuite en calculant une réponse en exécutant cette LF sur une base de connaissances. La partie centrale d'un tel système est l'analyseur sémantique qui transforme les questions en formes logiques. Notre objectif est de construire des analyseurs sémantiques performants en apprenant à partir de paires (NL, LF). Nous proposons de combiner des réseaux neuronaux récurrents (RNN) avec des connaissances préalables symboliques exprimées à travers des grammaires hors-contexte (CFGs) et des automates. En intégrant des CFGs contrôlant la validité des LFs dans les processus d'apprentissage et d'inférence des RNNs, nous garantissons que les formes logiques générées sont bien formées; en intégrant, par le biais d'automates pondérés, des connaissances préalables sur la présence de certaines entités dans la LF, nous améliorons encore la performance de nos modèles. Expérimentalement, nous montrons que notre approche permet d'obtenir de meilleures performances que les analyseurs sémantiques qui n'utilisent pas de réseaux neuronaux, ainsi que les analyseurs à base de RNNs qui ne sont pas informés par de telles connaissances préalables / Our goal in this thesis is to build a system that answers a natural language question (NL) by representing its semantics as a logical form (LF) and then computing the answer by executing the LF over a knowledge base. The core part of such a system is the semantic parser that maps questions to logical forms. Our focus is how to build high-performance semantic parsers by learning from (NL, LF) pairs. We propose to combine recurrent neural networks (RNNs) with symbolic prior knowledge expressed through context-free grammars (CFGs) and automata. By integrating CFGs over LFs into the RNN training and inference processes, we guarantee that the generated logical forms are well-formed; by integrating, through weighted automata, prior knowledge over the presence of certain entities in the LF, we further enhance the performance of our models. Experimentally, we show that our approach achieves better performance than previous semantic parsers not using neural networks as well as RNNs not informed by such prior knowledge Parsing sémantique Réseaux neuronaux Méthodes symboliques Semantic parsing Deep learning Symbolic methods 006.35
40	Unsupervised induction of semantic roles Lang, Joel January 2012 (has links) In recent years, a considerable amount of work has been devoted to the task of automatic frame-semantic analysis. Given the relative maturity of syntactic parsing technology, which is an important prerequisite, frame-semantic analysis represents a realistic next step towards broad-coverage natural language understanding and has been shown to benefit a range of natural language processing applications such as information extraction and question answering. Due to the complexity which arises from variations in syntactic realization, data-driven models based on supervised learning have become the method of choice for this task. However, the reliance on large amounts of semantically labeled data which is costly to produce for every language, genre and domain, presents a major barrier to the widespread application of the supervised approach. This thesis therefore develops unsupervised machine learning methods, which automatically induce frame-semantic representations without making use of semantically labeled data. If successful, unsupervised methods would render manual data annotation unnecessary and therefore greatly benefit the applicability of automatic framesemantic analysis. We focus on the problem of semantic role induction, in which all the argument instances occurring together with a specific predicate in a corpus are grouped into clusters according to their semantic role. Our hypothesis is that semantic roles can be induced without human supervision from a corpus of syntactically parsed sentences, by leveraging the syntactic relations conveyed through parse trees with lexical-semantic information. We argue that semantic role induction can be guided by three linguistic principles. The first is the well-known constraint that semantic roles are unique within a particular frame. The second is that the arguments occurring in a specific syntactic position within a specific linking all bear the same semantic role. The third principle is that the (asymptotic) distribution over argument heads is the same for two clusters which represent the same semantic role. We consider two approaches to semantic role induction based on two fundamentally different perspectives on the problem. Firstly, we develop feature-based probabilistic latent structure models which capture the statistical relationships that hold between the semantic role and other features of an argument instance. Secondly, we conceptualize role induction as the problem of partitioning a graph whose vertices represent argument instances and whose edges express similarities between these instances. The graph thus represents all the argument instances for a particular predicate occurring in the corpus. The similarities with respect to different features are represented on different edge layers and accordingly we develop algorithms for partitioning such multi-layer graphs. We empirically validate our models and the principles they are based on and show that our graph partitioning models have several advantages over the feature-based models. In a series of experiments on both English and German the graph partitioning models outperform the feature-based models and yield significantly better scores over a strong baseline which directly identifies semantic roles with syntactic positions. In sum, we demonstrate that relatively high-quality shallow semantic representations can be induced without human supervision and foreground a promising direction of future research aimed at overcoming the problem of acquiring large amounts of lexicalsemantic knowledge. 006.35

Search results