Global ETD Search

291	Wide-coverage parsing for Turkish Çakici, Ruket January 2009 (has links) Wide-coverage parsing is an area that attracts much attention in natural language processing research. This is due to the fact that it is the first step tomany other applications in natural language understanding, such as question answering. Supervised learning using human-labelled data is currently the best performing method. Therefore, there is great demand for annotated data. However, human annotation is very expensive and always, the amount of annotated data is much less than is needed to train well-performing parsers. This is the motivation behind making the best use of data available. Turkish presents a challenge both because syntactically annotated Turkish data is relatively small and Turkish is highly agglutinative, hence unusually sparse at the whole word level. METU-Sabancı Treebank is a dependency treebank of 5620 sentences with surface dependency relations and morphological analyses for words. We show that including even the crudest forms of morphological information extracted from the data boosts the performance of both generative and discriminative parsers, contrary to received opinion concerning English. We induce word-based and morpheme-based CCG grammars from Turkish dependency treebank. We use these grammars to train a state-of-the-art CCG parser that predicts long-distance dependencies in addition to the ones that other parsers are capable of predicting. We also use the correct CCG categories as simple features in a graph-based dependency parser and show that this improves the parsing results. We show that a morpheme-based CCG lexicon for Turkish is able to solve many problems such as conflicts of semantic scope, recovering long-range dependencies, and obtaining smoother statistics from the models. CCG handles linguistic phenomena i.e. local and long-range dependencies more naturally and effectively than other linguistic theories while potentially supporting semantic interpretation in parallel. Using morphological information and a morpheme-cluster based lexicon improve the performance both quantitatively and qualitatively for Turkish. We also provide an improved version of the treebank which will be released by kind permission of METU and Sabancı. 300.285
292	Active learning : an explicit treatment of unreliable parameters Becker, Markus January 2008 (has links) Active learning reduces annotation costs for supervised learning by concentrating labelling efforts on the most informative data. Most active learning methods assume that the model structure is fixed in advance and focus upon improving parameters within that structure. However, this is not appropriate for natural language processing where the model structure and associated parameters are determined using labelled data. Applying traditional active learning methods to natural language processing can fail to produce expected reductions in annotation cost. We show that one of the reasons for this problem is that active learning can only select examples which are already covered by the model. In this thesis, we better tailor active learning to the need of natural language processing as follows. We formulate the Unreliable Parameter Principle: Active learning should explicitly and additionally address unreliably trained model parameters in order to optimally reduce classification error. In order to do so, we should target both missing events and infrequent events. We demonstrate the effectiveness of such an approach for a range of natural language processing tasks: prepositional phrase attachment, sequence labelling, and syntactic parsing. For prepositional phrase attachment, the explicit selection of unknown prepositions significantly improves coverage and classification performance for all examined active learning methods. For sequence labelling, we introduce a novel active learning method which explicitly targets unreliable parameters by selecting sentences with many unknown words and a large number of unobserved transition probabilities. For parsing, targeting unparseable sentences significantly improves coverage and f-measure in active learning. 006.3
293	Logarithmic opinion pools for conditional random fields Smith, Andrew January 2007 (has links) Since their recent introduction, conditional random fields (CRFs) have been successfully applied to a multitude of structured labelling tasks in many different domains. Examples include natural language processing (NLP), bioinformatics and computer vision. Within NLP itself we have seen many different application areas, like named entity recognition, shallow parsing, information extraction from research papers and language modelling. Most of this work has demonstrated the need, directly or indirectly, to employ some form of regularisation when applying CRFs in order to overcome the tendency for these models to overfit. To date a popular method for regularising CRFs has been to fit a Gaussian prior distribution over the model parameters. In this thesis we explore other methods of CRF regularisation, investigating their properties and comparing their effectiveness. We apply our ideas to sequence labelling problems in NLP, specifically part-of-speech tagging and named entity recognition. We start with an analysis of conventional approaches to CRF regularisation, and investigate possible extensions to such approaches. In particular, we consider choices of prior distribution other than the Gaussian, including the Laplacian and Hyperbolic; we look at the effect of regularising different features separately, to differing degrees, and explore how we may define an appropriate level of regularisation for each feature; we investigate the effect of allowing the mean of a prior distribution to take on non-zero values; and we look at the impact of relaxing the feature expectation constraints satisfied by a standard CRF, leading to a modified CRF model we call the inequality CRF. Our analysis leads to the general conclusion that although there is some capacity for improvement of conventional regularisation through modification and extension, this is quite limited. Conventional regularisation with a prior is in general hampered by the need to fit a hyperparameter or set of hyperparameters, which can be an expensive process. We then approach the CRF overfitting problem from a different perspective. Specifically, we introduce a form of CRF ensemble called a logarithmic opinion pool (LOP), where CRF distributions are combined under a weighted product. We show how a LOP has theoretical properties which provide a framework for designing new overfitting reduction schemes in terms of diverse models, and demonstrate how such diverse models may be constructed in a number of different ways. Specifically, we show that by constructing CRF models from manually crafted partitions of a feature set and combining them with equal weight under a LOP, we may obtain an ensemble that significantly outperforms a standard CRF trained on the entire feature set, and is competitive in performance to a standard CRF regularised with a Gaussian prior. The great advantage of LOP approach is that, unlike the Gaussian prior method, it does not require us to search a hyperparameter space. Having demonstrated the success of LOPs in the simple case, we then move on to consider more complex uses of the framework. In particular, we investigate whether it is possible to further improve the LOP ensemble by allowing parameters in different models to interact during training in such a way that diversity between the models is encouraged. Lastly, we show how the LOP approach may be used as a remedy for a problem that standard CRFs can sometimes suffer. In certain situations, negative effects may be introduced to a CRF by the inclusion of highly discriminative features. An example of this is provided by gazetteer features, which encode a word's presence in a gazetteer. We show how LOPs may be used to reduce these negative effects, and so provide some insight into how gazetteer features may be more effectively handled in CRFs, and log-linear models in general. 005.3
294	Automation of summarization evaluation methods and their application to the summarization process Nahnsen, Thade January 2011 (has links) Summarization is the process of creating a more compact textual representation of a document or a collection of documents. In view of the vast increase in electronically available information sources in the last decade, filters such as automatically generated summaries are becoming ever more important to facilitate the efficient acquisition and use of required information. Different methods using natural language processing (NLP) techniques are being used to this end. One of the shallowest approaches is the clustering of available documents and the representation of the resulting clusters by one of the documents; an example of this approach is the Google News website. It is also possible to augment the clustering of documents with a summarization process, which would result in a more balanced representation of the information in the cluster, NewsBlaster being an example. However, while some systems are already available on the web, summarization is still considered a difficult problem in the NLP community. One of the major problems hampering the development of proficient summarization systems is the evaluation of the (true) quality of system-generated summaries. This is exemplified by the fact that the current state-of-the-art evaluation method to assess the information content of summaries, the Pyramid evaluation scheme, is a manual procedure. In this light, this thesis has three main objectives. 1. The development of a fully automated evaluation method. The proposed scheme is rooted in the ideas underlying the Pyramid evaluation scheme and makes use of deep syntactic information and lexical semantics. Its performance improves notably on previous automated evaluation methods. 2. The development of an automatic summarization system which draws on the conceptual idea of the Pyramid evaluation scheme and the techniques developed for the proposed evaluation system. The approach features the algorithm for determining the pyramid and bases importance on the number of occurrences of the variable-sized contributors of the pyramid as opposed to word-based methods exploited elsewhere. 3. The development of a text coherence component that can be used for obtaining the best ordering of the sentences in a summary. 621.382
295	Generating affective natural language for parents of neonatal infants Mahamood, Saad Ali January 2010 (has links) The thesis presented here describes original research in the field of Natural Language Generation (NLG). NLG is the subfield of artificial intelligence that is concerned with the automatic production of documents from underlying data. This thesis in particular focuses on developing new and novel methods for generating text that takes into consideration the recipient’s level of stress as a factor to adapt the resultant textural output. This consideration of taking the recipient level of stress was particularly salient due to the domain that this research was conducted under; providing information for parents of pre-term infants during neonatal intensive care (NICU). A highly technical and stressful environment for parents where emotional sensitivity must be shown for the nature of information presented. We have investigated the emotional and informational needs of these parents through an extensive past literature review and two separate research studies with former and current NICU parents. The NLG system built for this research was called BabyTalk Family (BT-Family). A system that can produce a textual summary of medical events that has occurred for a baby in NICU in last twenty-four hours for parents. The novelty of this system is that is capable of estimating the level of stress of the recipient and by using several affective NLG strategies it is able to tailor it’s output for a stressed audience. Unlike traditional NLG systems where the output would remain unchanged regardless of emotional state of the recipient. The key innovation in this system was the integration of several affective strategies in the Document Planner for tailoring textual output for stress recipients. BT-Family’s output was evaluated with thirteen parents that previously had baby in neonatal care. We developed a methodology for an evaluation that involved a direct comparison between stressed and unstressed text for the same given medical scenario for variables such as preference, understandability, helpfulness, and emotional appropriateness. The results, obtained showed the parents overwhelming preferred the stressed text for all of the variables measured. 006.3
296	Using natural language generation to provide access to semantic metadata Hielkema, Feikje January 2010 (has links) In recent years, the use of using metadata to describe and share resources has grown in importance, especially in the context of the Semantic Web. However, access to metadata is difficult for users without experience with description logic or formal languages, and currently this description applies to most web users. There is a strong need for interfaces that provide easy access to semantic metadata, enabling novice users to browse, query and create it easily. This thesis describes a natural language generation interface to semantic metadata called LIBER (Language Interface for Browsing and Editing Rdf), driven by domain ontologies which are integrated with domain-specific linguistic information. LIBER uses the linguistic information to generate fluent descriptions and search terms through syntactic aggregation. The tool contains three modules to support metadata creation, querying and browsing, which implement the WYSIWYM (What You See Is What You Meant) natural language generation approach. Users can add and remove information by editing system-generated feedback texts. Two studies have been conducted to evaluate LIBER’s usability, and compare it to a different Semantic Web interface. The studies showed subjects with no prior experience of the Semantic Web could use LIBER effectively to create, search and browse metadata, and were a useful source of ideas in which to improve LIBER’s usability. However, the results of these studies were less positive than we had hoped, and users actually preferred the other Semantic Web tool. This has raised questions about which user audience LIBER should aim for, and the extent to which the underlying ontologies influence the usability of the interface. LIBER’s portability to other domains is supported by a tool with which ontology developers without a background in linguistics can prepare their ontologies for use in LIBER by adding the necessary linguistic information. 020
297	Implication textuelle et réécriture / Textual Entailment and rewriting Bedaride, Paul 18 October 2010 (has links) Cette thèse propose plusieurs contributions sur le thème de la détection d'implications textuelles (DIT). La DIT est la capacité humaine, étant donné deux textes, à pouvoir dire si le sens du second texte peut être déduit à partir de celui du premier. Une des contributions apportée au domaine est un système de DIT hybride prenant les analyses d'un analyseur syntaxique stochastique existant afin de les étiqueter avec des rôles sémantiques, puis transformant les structures obtenues en formules logiques grâce à des règles de réécriture pour tester finalement l'implication à l'aide d'outils de preuve. L'autre contribution de cette thèse est la génération de suites de tests finement annotés avec une distribution uniforme des phénomènes couplée avec une nouvelle méthode d'évaluation des systèmes utilisant les techniques de fouille d'erreurs développées par la communauté de l'analyse syntaxique permettant une meilleure identification des limites des systèmes. Pour cela nous créons un ensemble de formules sémantiques puis nous générons les réalisations syntaxiques annotées correspondantes à l'aide d'un système de génération existant. Nous testons ensuite s'il y a implication ou non entre chaque couple de réalisations syntaxiques possible. Enfin nous sélectionnons un sous-ensemble de cet ensemble de problèmes d'une taille donnée et satisfaisant un certain nombre de contraintes à l'aide d'un algorithme que nous avons développé. / This thesis presents several contributions on the theme of recognising textual entailment (RTE). The RTE is the human capacity, given two texts, to determine whether the meaning of the second text could be deduced from the meaning of the first or not. One of the contributions made to the field is a hybrid system of RTE taking analysis of an existing stochastic parser to label them with semantics roles, then turning obtained structures in logical formulas using rewrite rules to finally test the entailment using proof tools. Another contribution of this thesis is the generation of finely annotated tests suites with a uniform distribution of phenomena coupled with a new methodology of systems evaluation using error minning techniques developed by the community of parsing allowing better identification of systems limitations. For this, we create a set of formulas, then we generate annotated syntactics realisations corresponding by using an existing generation system. Then, we test whether or not there is an entailment between each pair of possible syntactics realisations. Finally, we select a subset of this set of problems of a given size and a satisfactory a certain number of constraints using an algorithm that we developed Traitement automatique des langues Réécriture Représentation Raisonnement Natural Language Processing Rewriting Representation Reasoning 410.285 006.35
298	Natural language processing of online propaganda as a means of passively monitoring an adversarial ideology Holm, Raven R. 03 1900 (has links) Approved for public release; distribution is unlimited / Reissued 30 May 2017 with Second Reader’s non-NPS affiliation added to title page. / Online propaganda embodies a potent new form of warfare; one that extends the strategic reach of our adversaries and overwhelms analysts. Foreign organizations have effectively leveraged an online presence to influence elections and distance-recruit. The Islamic State has also shown proficiency in outsourcing violence, proving that propaganda can enable an organization to wage physical war at very little cost and without the resources traditionally required. To augment new counter foreign propaganda initiatives, this thesis presents a pipeline for defining, detecting and monitoring ideology in text. A corpus of 3,049 modern online texts was assembled and two classifiers were created: one for detecting authorship and another for detecting ideology. The classifiers demonstrated 92.70% recall and 95.84% precision in detecting authorship, and detected ideological content with 76.53% recall and 95.61% precision. Both classifiers were combined to simulate how an ideology can be detected and how its composition could be passively monitored across time. Implementation of such a system could conserve manpower in the intelligence community and add a new dimension to analysis. Although this pipeline makes presumptions about the quality and integrity of input, it is a novel contribution to the fields of Natural Language Processing and Information Warfare. / Lieutenant, United States Coast Guard data mining natural language processing machine learning algorithm design information warfare propaganda
299	Disentangling Discourse: Networks, Entropy, and Social Movements Gallagher, Ryan 01 January 2017 (has links) Our daily online conversations with friends, family, colleagues, and strangers weave an intricate network of interactions. From these networked discussions emerge themes and topics that transcend the scope of any individual conversation. In turn, these themes direct the discourse of the network and continue to ebb and flow as the interactions between individuals shape the topics themselves. This rich loop between interpersonal conversations and overarching topics is a wonderful example of a complex system: the themes of a discussion are more than just the sum of its parts. Some of the most socially relevant topics emerging from these online conversations are those pertaining to racial justice issues. Since the shooting of Black teenager Michael Brown by White police officer Darren Wilson in Ferguson, Missouri, the protest hashtag #BlackLivesMatter has amplified critiques of extrajudicial shootings of Black Americans. In response to #BlackLivesMatter, other online users have adopted #AllLivesMatter, a counter-protest hashtag whose content argues that equal attention should be given to all lives regardless of race. Together these contentious hashtags each shape clashing narratives that echo previous civil rights battles and illustrate ongoing racial tension between police officers and Black Americans. These narratives have taken place on a massive scale with millions of online posts and articles debating the sentiments of "black lives matter" and "all lives matter." Since no one person could possibly read everything written in this debate, comprehensively understanding these conversations and their underlying networks requires us to leverage tools from data science, machine learning, and natural language processing. In Chapter 2, we utilize methodology from network science to measure to what extent #BlackLivesMatter and #AllLivesMatter are "slacktivist" movements, and the effect this has on the diversity of topics discussed within these hashtags. In Chapter 3, we precisely quantify the ways in which the discourse of #BlackLivesMatter and #AllLivesMatter diverge through the application of information-theoretic techniques, validating our results at the topic level from Chapter 2. These entropy-based approaches provide the foundation for powerful automated analysis of textual data, and we explore more generally how they can be used to construct a human-in-the-loop topic model in Chapter 4. Our work demonstrates that there is rich potential for weaving together social science domain knowledge with computational tools in the study of language, networks, and social movements. Black Lives Matter Information Theory Natural Language Processing Polarization Social Networks Topic Model Computer Sciences Mathematics
300	Exploration des réseaux de neurones à base d'autoencodeur dans le cadre de la modélisation des données textuelles Lauly, Stanislas January 2016 (has links) Depuis le milieu des années 2000, une nouvelle approche en apprentissage automatique, l'apprentissage de réseaux profonds (deep learning), gagne en popularité. En effet, cette approche a démontré son efficacité pour résoudre divers problèmes en améliorant les résultats obtenus par d'autres techniques qui étaient considérées alors comme étant l'état de l'art. C'est le cas pour le domaine de la reconnaissance d'objets ainsi que pour la reconnaissance de la parole. Sachant cela, l’utilisation des réseaux profonds dans le domaine du Traitement Automatique du Langage Naturel (TALN, Natural Language Processing) est donc une étape logique à suivre. Cette thèse explore différentes structures de réseaux de neurones dans le but de modéliser le texte écrit, se concentrant sur des modèles simples, puissants et rapides à entraîner. Deep learning Réseaux profonds Réseau de neurones TALN Natural language processing NLP

Search results