Global ETD Search

21	Coping with Missing and Incomplete Information in Natural Language Processing with Applications in Sentiment Analysis and Entity Matching Schneider, Andrew Thomas January 2020 (has links) Much work in Natural Language Processing (NLP) is broadly concerned with extracting useful information from unstructured text passages. In recent years there has been an increased focus on informal writing as is found in online venues such as Twitter and Yelp. Processing this text introduces additional difficulties for NLP techniques, for example, many of the terms may be unknown due to rapidly changing vocabulary usage. A straightforward NLP approach will not have any capability of using the information these terms provide. In such \emph{information poor} environments of missing and incomplete information, it is necessary to develop novel, clever methods for leveraging the information we have explicitly available to unlock key nuggets of implicitly available information. In this work we explore several such methods and how they can collectively help to improve NLP techniques in general, with a focus on Sentiment Analysis (SA) and Entity Matching (EM). The problem of SA is that of identifying the polarity (positive, negative, neutral) of a speaker or author towards the topic of a given piece of text. SA can focus on various levels of granularity. These include finding the overall sentiment of a long text document, finding the sentiment of individual sentences or phrases, or finding the sentiment directed toward specific entities and their aspects (attributes). The problem of EM, also known as Record Linkage, is the problem of determining records from independent and uncooperative data sources that refer to the same real-world entities. Traditional approaches to EM have used the record representation of entities to accomplish this task. With the nascence of social media, entities on the Web are now accompanied by user generated content, which allows us to apply NLP solutions to the problem. We investigate specifically the following aspects of NLP for missing and incomplete information: (1) Inferring a sentiment polarity (i.e., the positive, negative, and neutral composition) of new terms. (2) Inferring a representation of new vocabulary terms that allows us to compare these terms with known terms in regards to their meaning and sentiment orientation. This idea can be further expanded to derive the representation of larger chunks of text, such as multi-word phrases. (3) Identifying key attributes of highly salient sentiment bearing passages that allow us to identify such sections of a document, even when the complete text is not analyzable. (4) Using text based methods to match corresponding entities (e.g., restaurants or hotels) from independent data sources that may miss key identifying attributes such as names or addresses. / Computer and Information Science Computer Science Natural Language Processing
22	Generative Chatbot Framework for Cybergrooming Prevention Wang, Pei 20 December 2021 (has links) Cybergrooming refers to the crime of establishing personal close relationships with potential victims, commonly teens, for the purpose of sexual exploitation or abuse via online social media platforms. Cybergrooming has been recognized as a serious social problem. However, there have been insufficient programs to provide proactive prevention to protect the youth users from cybergrooming. In this thesis, we present a generative chatbot framework, called SERI (Stop cybERgroomIng), that can generate simulated conversations between a perpetrator chatbot and a potential victim chatbot. To realize the simulation of authentic conversations in the context of cybergrooming, we take deep reinforcement learning (DRL)-based dialogue generation to simulate the authentic conversations between a perpetrator and a potential victim. The design and development of the SERI are motivated to provide a safe and authentic chatting environment to enhance the youth's precautionary awareness and sensitivity of cybergrooming while any unnecessary ethical issues (e.g., the potential misuse of the SERI) are removed or minimized. We developed the SERI as a preliminary platform that the perpetrator chatbot can be deployed in social media environments to interact with human users (i.e., youth) and observe the conversations that the youth users respond to strangers or acquaintances when they are asked for private or sensitive information by the perpetrator. We evaluated the quality of conversations generated by the SERI based on open-source, referenced, and unreferenced metrics as well as human evaluation. The evaluation results show that the SERI can generate authentic conversations between two chatbots compared to the original conversations from the used datasets in perplexity and MaUde scores. / Master of Science / Cybergrooming refers to the crime of building personal close relationships with potential victims, especially youth users such as children and teenagers, for the purpose of sexual exploitation or abuse via online social media platforms. Cybergrooming has been recognized as a serious social problem. However, there have been insufficient methods to provide proactive protection for the youth users from cybergrooming. In this thesis, we present a generative chatbot framework, called SERI (Stop cybERgroomIng), that can generate simulated authentic conversations between a perpetrator chatbot and a potential victim chatbot by applying advanced natural language generation models. The design and development of the SERI are motivated to ensure a safe and authentic environment to strengthen the youth's precautionary awareness and sensitivity of cybergrooming while any unnecessary ethical issues (e.g., the potential misuse of the SERI) are removed or minimized. We used different metrics and methods to evaluate the quality of conversations generated by the SERI. The evaluation results show that the SERI can generate authentic conversations between two chatbots compared to the original conversations from the used datasets. Cybergrooming Natural Language Processing Chatbot
23	Joint models for concept-to-text generation Konstas, Ioannis January 2014 (has links) Much of the data found on the world wide web is in numeric, tabular, or other nontextual format (e.g., weather forecast tables, stock market charts, live sensor feeds), and thus inaccessible to non-experts or laypersons. However, most conventional search engines and natural language processing tools (e.g., summarisers) can only handle textual input. As a result, data in non-textual form remains largely inaccessible. Concept-to-text generation refers to the task of automatically producing textual output from non-linguistic input, and holds promise for rendering non-linguistic data widely accessible. Several successful generation systems have been produced in the past twenty years. They mostly rely on human-crafted rules or expert-driven grammars, implement a pipeline architecture, and usually operate in a single domain. In this thesis, we present several novel statistical models that take as input a set of database records and generate a description of them in natural language text. Our unique idea is to combine the processes of structuring a document (document planning), deciding what to say (content selection) and choosing the specific words and syntactic constructs specifying how to say it (lexicalisation and surface realisation), in a uniform joint manner. Rather than breaking up the generation process into a sequence of local decisions, we define a probabilistic context-free grammar that globally describes the inherent structure of the input (a corpus of database records and text describing some of them). This joint representation allows individual processes (i.e., document planning, content selection, and surface realisation) to communicate and influence each other naturally. We recast generation as the task of finding the best derivation tree for a set of input database records and our grammar, and describe several algorithms for decoding in this framework that allows to intersect the grammar with additional information capturing fluency and syntactic well-formedness constraints. We implement our generators using the hypergraph framework. Contrary to traditional systems, we learn all the necessary document, structural and linguistic knowledge from unannotated data. Additionally, we explore a discriminative reranking approach on the hypergraph representation of our model, by including more refined content selection features. Central to our approach is the idea of porting our models to various domains; we experimented on four widely different domains, namely sportscasting, weather forecast generation, booking flights, and troubleshooting guides. The performance of our systems is competitive and often superior compared to state-of-the-art systems that use domain specific constraints, explicit feature engineering or labelled data. 006.3
24	Latent variable models of distributional lexical semantics Reisinger, Joseph Simon 24 October 2014 (has links) Computer Sciences / In order to respond to increasing demand for natural language interfaces---and provide meaningful insight into user query intent---fast, scalable lexical semantic models with flexible representations are needed. Human concept organization is a rich phenomenon that has yet to be accounted for by a single coherent psychological framework: Concept generalization is captured by a mixture of prototype and exemplar models, and local taxonomic information is available through multiple overlapping organizational systems. Previous work in computational linguistics on extracting lexical semantic information from unannotated corpora does not provide adequate representational flexibility and hence fails to capture the full extent of human conceptual knowledge. In this thesis I outline a family of probabilistic models capable of capturing important aspects of the rich organizational structure found in human language that can predict contextual variation, selectional preference and feature-saliency norms to a much higher degree of accuracy than previous approaches. These models account for cross-cutting structure of concept organization---i.e. selective attention, or the notion that humans make use of different categorization systems for different kinds of generalization tasks---and can be applied to Web-scale corpora. Using these models, natural language systems will be able to infer a more comprehensive semantic relations, which in turn may yield improved systems for question answering, text classification, machine translation, and information retrieval. / text Lexical semantics Natural language processing Machine learning
25	Term recognition using combined knowledge sources Maynard, Diana Gabrielle January 2000 (has links) No description available. 410
26	Development of a hybrid symbolic/connectionist system for word sense disambiguation Wu, Xinyu January 1995 (has links) No description available. 005
27	Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing Coppola, Gregory Francis January 2015 (has links) The development of distributed training strategies for statistical prediction functions is important for applications of machine learning, generally, and the development of distributed structured prediction training strategies is important for natural language processing (NLP), in particular. With ever-growing data sets this is, first, because, it is easier to increase computational capacity by adding more processor nodes than it is to increase the power of individual processor nodes, and, second, because data sets are often collected and stored in different locations. Iterative parameter mixing (IPM) is a distributed training strategy in which each node in a network of processors optimizes a regularized average loss objective on its own subset of the total available training data, making stochastic (per-example) updates to its own estimate of the optimal weight vector, and communicating with the other nodes by periodically averaging estimates of the optimal vector across the network. This algorithm has been contrasted with a close relative, called here the single-mixture optimization algorithm, in which each node stochastically optimizes an average loss objective on its own subset of the training data, operating in isolation until convergence, at which point the average of the independently created estimates is returned. Recent empirical results have suggested that this IPM strategy produces better models than the single-mixture algorithm, and the results of this thesis add to this picture. The contributions of this thesis are as follows. The first contribution is to produce and analyze an algorithm for decentralized stochastic optimization of regularized average loss objective functions. This algorithm, which we call the distributed regularized dual averaging algorithm, improves over prior work on distributed dual averaging by providing a simpler algorithm (used in the rest of the thesis), better convergence bounds for the case of regularized average loss functions, and certain technical results that are used in the sequel. The central contribution of this thesis is to give an optimization-theoretic justification for the IPM algorithm. While past work has focused primarily on its empirical test-time performance, we give a novel perspective on this algorithm by showing that, in the context of the distributed dual averaging algorithm, IPM constitutes a convergent optimization algorithm for arbitrary convex functions, while the single-mixture distribution algorithm is not. Experiments indeed confirm that the superior test-time performance of models trained using IPM, compared to single-mixture, correlates with better optimization of the objective value on the training set, a fact not previously reported. Furthermore, our analysis of general non-smooth functions justifies the use of distributed large-margin (support vector machine [SVM]) training of structured predictors, which we show yields better test performance than the IPM perceptron algorithm, the only version of the IPM to have previously been given a theoretical justification. Our results confirm that IPM training can reach the same level of test performance as a sequentially trained model and can reach better accuracies when one has a fixed budget of training time. Finally, we use the reduction in training time that distributed training allows to experiment with adding higher-order dependency features to a state-of-the-art phrase-structure parsing model. We demonstrate that adding these features improves out-of-domain parsing results of even the strongest phrase-structure parsing models, yielding a new state-of-the-art for the popular train-test pairs considered. In addition, we show that a feature-bagging strategy, in which component models are trained separately and later combined, is sometimes necessary to avoid feature under-training and get the best performance out of large feature sets. 006.3
28	Evaluating distributional models of compositional semantics Batchkarov, Miroslav Manov January 2016 (has links) Distributional models (DMs) are a family of unsupervised algorithms that represent the meaning of words as vectors. They have been shown to capture interesting aspects of semantics. Recent work has sought to compose word vectors in order to model phrases and sentences. The most commonly used measure of a compositional DM's performance to date has been the degree to which it agrees with human-provided phrase similarity scores. The contributions of this thesis are three-fold. First, I argue that existing intrinsic evaluations are unreliable as they make use of small and subjective gold-standard data sets and assume a notion of similarity that is independent of a particular application. Therefore, they do not necessarily measure how well a model performs in practice. I study four commonly used intrinsic datasets and demonstrate that all of them exhibit undesirable properties. Second, I propose a novel framework within which to compare word- or phrase-level DMs in terms of their ability to support document classification. My approach couples a classifier to a DM and provides a setting where classification performance is sensitive to the quality of the DM. Third, I present an empirical evaluation of several methods for building word representations and composing them within my framework. I find that the determining factor in building word representations is data quality rather than quantity; in some cases only a small amount of unlabelled data is required to reach peak performance. Neural algorithms for building single-word representations perform better than counting-based ones regardless of what composition is used, but simple composition algorithms can outperform more sophisticated competitors. Finally, I introduce a new algorithm for improving the quality of distributional thesauri using information from repeated runs of the same non deterministic algorithm. 006.3
29	Graph-based approaches to word sense induction Hope, David Richard January 2015 (has links) This thesis is a study of Word Sense Induction (WSI), the Natural Language Processing (NLP) task of automatically discovering word meanings from text. WSI is an open problem in NLP whose solution would be of considerable benefit to many other NLP tasks. It has, however, has been studied by relatively few NLP researchers and often in set ways. Scope therefore exists to apply novel methods to the problem, methods that may improve upon those previously applied. This thesis applies a graph-theoretic approach to WSI. In this approach, word senses are identifed by finding particular types of subgraphs in word co-occurrence graphs. A number of original methods for constructing, analysing, and partitioning graphs are introduced, with these methods then incorporated into graphbased WSI systems. These systems are then shown, in a variety of evaluation scenarios, to return results that are comparable to those of the current best performing WSI systems. The main contributions of the thesis are a novel parameter-free soft clustering algorithm that runs in time linear in the number of edges in the input graph, and novel generalisations of the clustering coeficient (a measure of vertex cohesion in graphs) to the weighted case. Further contributions of the thesis include: a review of graph-based WSI systems that have been proposed in the literature; analysis of the methodologies applied in these systems; analysis of the metrics used to evaluate WSI systems, and empirical evidence to verify the usefulness of each novel method introduced in the thesis for inducing word senses. 004
30	Paraphrase identification using knowledge-lean techniques Eyecioglu Ozmutlu, Asli January 2016 (has links) This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP. Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call “knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri. The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question “To what extent can these methods be used for the purpose of a paraphrase identification task?” For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs). These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrasecorpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or fine-grained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques. Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods. 006.3

Search results