Global ETD Search

61	Typesafe NLP pipelines on Spark Hafner, Simon 24 February 2015 (has links) Natural language pipelines consist of various natural language algorithms that use the annotations of a previous algorithm to compute more annotations. These algorithms tend to be expensive in terms of computational power. Therefore it is advantageous to parallelize them in order to reduce the time necessary to analyze a large document collection. The goal of this project was to develop a new framework to encapsulate algorithms such that they may be used as part of a pipeline without any additional work. The framework consists of a custom-built data structure called Slab which implements type safety and functional transparency to integrate itself into the Scala programming language. Because of this integration, it is possible to use Spark, a MapReduce framework, to parallelize the pipeline on a cluster. To assess the performance of the new framework, a pipeline based on the OpenNLP library was created. An existing pipeline implemented in UIMA, an industry standard for natural language pipeline frameworks, served as a baseline in terms of performance. The pipeline created from the new framework processed the corpus in about half the time. / text Natural language processing NLP Pipelines Spark Slab
62	Text mining with information extraction Nahm, Un Yong 28 August 2008 (has links) Not available / text Data mining
63	Generating reference to visible objects Mitchell, Margaret January 2013 (has links) In this thesis, I examine human-like language generation from a visual input head-on, exploring how people refer to visible objects in the real world. Using previous work and the studies from this thesis, I propose an algorithm that generates humanlike reference to visible objects. Rather than introduce a general-purpose REG algorithm, as is tradition, I address the sorts of properties that visual domains in particular make available, and the ways that these must be processed in order to be used in a referring expression algorithm. This method uncovers several issues in generating human-like language that have not been thoroughly studied before. I focus on the properties of color, size, shape, and material, and address the issues of algorithm determinism and how speaker variation may be generated; unique identification of objects and whether this is an appropriate goal for generating humanlike reference; atypicality and the role it plays in reference; and multi-featured values for visual attributes. Technical contributions from this thesis include (1) an algorithm for generating size modifiers from features in a visual scene; and (2) a referring expression generation algorithm that generates structures for varied, human-like reference. 004
64	A computational model of lexical incongruity in humorous text Venour, Chris January 2013 (has links) Many theories of humour claim that incongruity is an essential ingredient of humour. How- ever this idea is poorly understood and little work has been done in computational humour to quantify it. For example classifiers which attempt to distinguish jokes from regular texts tend to look for secondary features of humorous texts rather than for incongruity. Similarly most joke generators attempt to recreate structural patterns found in example jokes but do not deliberately endeavour to create incongruity. As in previous research, this thesis develops classifiers and a joke generator which attempt to automatically recognize and generate a type of humour. However the systems described here differ from previous programs because they implement a model of a certain type of humorous incongruity. We focus on a type of register humour we call lexical register jokes in which the tones of individual words are in conflict with each other. Our goal is to create a semantic space that reflects the kind of tone at play in lexical register jokes so that words that are far apart in the space are not simply different but exhibit the kinds of incongruities seen in lexical jokes. This thesis attempts to develop such a space and various classifiers are implemented to use it to distinguish lexical register jokes from regular texts. The best of these classifiers achieved high levels of accuracy when distinguishing between a test set of lexical register jokes and 4 different kinds of regular text. A joke generator which makes use of the semantic space to create original lexical register jokes is also implemented and described in this thesis. In a test of the generator, texts that were generated by the system were evaluated by volunteers who considered them not as humorous as human-made lexical register jokes but significantly more humorous than a set of control (i.e.non- joke) texts. This was an encouraging result which suggests that the vector space is somewhat successful in discovering lexical differences in tone and in modelling lexical register jokes. 004
65	Automatic multi-document summarization for digital libraries Ou, Shiyan, Khoo, Christopher S.G., Goh, Dion H. January 2006 (has links) With the rapid growth of the World Wide Web and online information services, more and more information is available and accessible online. Automatic summarization is an indispensable solution to reduce the information overload problem. Multi-document summarization is useful to provide an overview of a topic and allow users to zoom in for more details on aspects of interest. This paper reports three types of multi-document summaries generated for a set of research abstracts, using different summarization approaches: a sentence-based summary generated by a MEAD summarization system that extracts important sentences using various features, another sentence-based summary generated by extracting research objective sentences, and a variable-based summary focusing on research concepts and relationships. A user evaluation was carried out to compare the three types of summaries. The evaluation results indicated that the majority of users (70%) preferred the variable-based summary, while 55% of the users preferred the research objective summary, and only 25% preferred the MEAD summary. Information Extraction Digital Libraries Natural Language Processing
66	Statistical methods for spoken dialogue management Thomson, Blaise Roger Marie January 2010 (has links) No description available. 620
67	Acquiring syntactic and semantic transformations in question answering Kaisser, Michael January 2010 (has links) One and the same fact in natural language can be expressed in many different ways by using different words and/or a different syntax. This phenomenon, commonly called paraphrasing, is the main reason why Natural Language Processing (NLP) is such a challenging task. This becomes especially obvious in Question Answering (QA) where the task is to automatically answer a question posed in natural language, usually in a text collection also consisting of natural language texts. It cannot be assumed that an answer sentence to a question uses the same words as the question and that these words are combined in the same way by using the same syntactic rules. In this thesis we describe methods that can help to address this problem. Firstly we explore how lexical resources, i.e. FrameNet, PropBank and VerbNet can be used to recognize a wide range of syntactic realizations that an answer sentence to a given question can have. We find that our methods based on these resources work well for web-based Question Answering. However we identify two problems: 1) All three resources as of yet have significant coverage issues. 2) These resources are not suitable to identify answer sentences that show some form of indirect evidence. While the first problem hinders performance currently, it is not a theoretical problem that renders the approach unsuitable–it rather shows that more efforts have to be made to produce more complete resources. The second problem is more persistent. Many valid answer sentences–especially in small, journalistic corpora–do not provide direct evidence for a question, rather they strongly suggest an answer without logically implying it. Semantically motivated resources like FrameNet, PropBank and VerbNet can not easily be employed to recognize such forms of indirect evidence. In order to investigate ways of dealing with indirect evidence, we used Amazon’s Mechanical Turk to collect over 8,000 manually identified answer sentences from the AQUAINT corpus to the over 1,900 TREC questions from the 2002 to 2006 QA tracks. The pairs of answer sentences and their corresponding questions form the QASP corpus, which we released to the public in April 2008. In this dissertation, we use the QASP corpus to develop an approach to QA based on matching dependency relations between answer candidates and question constituents in the answer sentences. By acquiring knowledge about syntactic and semantic transformations from dependency relations in the QASP corpus, additional answer candidates can be identified that could not be linked to the question with our first approach. 006.35
68	Toward summarization of communicative activities in spoken conversation Niekrasz, John Joseph January 2012 (has links) This thesis is an inquiry into the nature and structure of face-to-face conversation, with a special focus on group meetings in the workplace. I argue that conversations are composed of episodes, each of which corresponds to an identifiable communicative activity such as giving instructions or telling a story. These activities are important because they are part of participants’ commonsense understanding of what happens in a conversation. They appear in natural summaries of conversations such as meeting minutes, and participants talk about them within the conversation itself. Episodic communicative activities therefore represent an essential component of practical, commonsense descriptions of conversations. The thesis objective is to provide a deeper understanding of how such activities may be recognized and differentiated from one another, and to develop a computational method for doing so automatically. The experiments are thus intended as initial steps toward future applications that will require analysis of such activities, such as an automatic minute-taker for workplace meetings, a browser for broadcast news archives, or an automatic decision mapper for planning interactions. My main theoretical contribution is to propose a novel analytical framework called participant relational analysis. The proposal argues that communicative activities are principally indicated through participant-relational features, i.e., expressions of relationships between participants and the dialogue. Participant-relational features, such as subjective language, verbal reference to the participants, and the distribution of speech activity amongst the participants, are therefore argued to be a principal means for analyzing the nature and structure of communicative activities. I then apply the proposed framework to two computational problems: automatic discourse segmentation and automatic discourse segment labeling. The first set of experiments test whether participant-relational features can serve as a basis for automatically segmenting conversations into discourse segments, e.g., activity episodes. Results show that they are effective across different levels of segmentation and different corpora, and indeed sometimes more effective than the commonly-used method of using semantic links between content words, i.e., lexical cohesion. They also show that feature performance is highly dependent on segment type, suggesting that human-annotated “topic segments” are in fact a multi-dimensional, heterogeneous collection of topic and activity-oriented units. Analysis of commonly used evaluation measures, performed in conjunction with the segmentation experiments, reveals that they fail to penalize substantially defective results due to inherent biases in the measures. I therefore preface the experiments with a comprehensive analysis of these biases and a proposal for a novel evaluation measure. A reevaluation of state-of-the-art segmentation algorithms using the novel measure produces substantially different results from previous studies. This raises serious questions about the effectiveness of some state-of-the-art algorithms and helps to identify the most appropriate ones to employ in the subsequent experiments. I also preface the experiments with an investigation of participant reference, an important type of participant-relational feature. I propose an annotation scheme with novel distinctions for vagueness, discourse function, and addressing-based referent inclusion, each of which are assessed for inter-coder reliability. The produced dataset includes annotations of 11,000 occasions of person-referring. The second set of experiments concern the use of participant-relational features to automatically identify labels for discourse segments. In contrast to assigning semantic topic labels, such as topical headlines, the proposed algorithm automatically labels segments according to activity type, e.g., presentation, discussion, and evaluation. The method is unsupervised and does not learn from annotated ground truth labels. Rather, it induces the labels through correlations between discourse segment boundaries and the occurrence of bracketing meta-discourse, i.e., occasions when the participants talk explicitly about what has just occurred or what is about to occur. Results show that bracketing meta-discourse is an effective basis for identifying some labels automatically, but that its use is limited if global correlations to segment features are not employed. This thesis addresses important pre-requisites to the automatic summarization of conversation. What I provide is a novel activity-oriented perspective on how summarization should be approached, and a novel participant-relational approach to conversational analysis. The experimental results show that analysis of participant-relational features is a. 302.34
69	Crawling, Collecting, and Condensing News Comments Gobaan, Raveendran January 2013 (has links) Traditionally, public opinion and policy is decided by issuing surveys and performing censuses designed to measure what the public thinks about a certain topic. Within the past five years social networks such as Facebook and Twitter have gained traction for collection of public opinion about current events. Academic research on Facebook data proves difficult since the platform is generally closed. Twitter on the other hand restricts the conversation of its users making it difficult to extract large scale concepts from the microblogging infrastructure. News comments provide a rich source of discourse from individuals who are passionate about an issue. Furthermore, due to the overhead of commenting, the population of commenters is necessarily biased towards individual who have either strong opinions of a topic or in depth knowledge of the given issue. Furthermore, their comments are often a collection of insight derived from reading multiple articles on any given topic. Unfortunately the commenting systems employed by news companies are not implemented by a single entity, and are often stored and generated using AJAX, which causes traditional crawlers to ignore them. To make matters worse they are often noisy; containing spam, poor grammar, and excessive typos. Furthermore, due to the anonymity of comment systems, conversations can often be derailed by malicious users or inherent biases in the commenters. In this thesis we discuss the design and creation of a crawler designed to extract comments from domains across the internet. For practical purposes we create a semiautomatic parser generator and describe how our system attempts to employ user feedback to predict which remote procedure calls are used to load comments. By reducing comment systems into remote procedure calls, we simplify the internet into a much simpler space, where we can focus on the data, almost independently from its presentation. Thus we are able to quickly create high fidelity parsers to extract comments from a web page. Once we have our system, we show the usefulness by attempting to extract meaningful opinions from the large collections we collect. Unfortunately doing so in real time is shown to foil traditional summarization systems, which are designed to handle dozens of well formed documents. In attempting to solve this problem we create a new algorithm, KLSum+, that outperforms all its competitors in efficiency while generally scoring well against the ROUGE SU4 metric. This algorithm factors in background models to boost accuracy, but performs over 50 times faster than alternatives. Furthermore, using the summaries we see that the data collected can provide useful insight into public opinion and even provide the key points of discourse.
70	Mining Question and Answer Sites for Automatic Comment Generation Edmund, Wong 28 April 2014 (has links) Code comments improve software maintainability, programming productivity, and software reliability. To address the comment scarcity issue in many projects and save developers’ time in writing comments, we propose a new, general automatic comment generation approach, which mines comments from a large programming Question and Answer (Q&A) site. Q&A sites allow programmers to post questions and receive solutions, which contain code segments together with their descriptions, referred to as code-description mappings. We develop AutoComment to extract such mappings, and leverage them to generate description comments automatically for similar code segments matched in open source projects. We apply AutoComment to analyze 92,140 Java and Android tagged Q&A posts to extract 132,767 code-description mappings, which help AutoComment generate 102 comments automatically for 23 Java and Android projects. The number of generated comments is still low, but the user study results show that the majority of the participants consider the generated comments accurate, adequate, concise, and useful in helping them understand the code. One of the advantages from mining Q&A sites for automatic comment generation is that human written comments can provide information that is not explicitly in the code. In the future, we would like to focus on improving both the yield and quality of the generated comments. To improve the yield, we can replace the token-based clone detection tool with one that can detect addition and reordering of lines to increase the number of code matches. To improve the quality, we can apply advanced natural language processing techniques such as semantic role labeling to analyze the semantics of the sentences, or typed dependencies to analyze the grammatical structure of the sentences. documentation comments natural language processing comment generation

Search results