Global ETD Search

61	The design and implementation of PRONTO processor for natural text organization Anderson, Steven Michael January 2010 (has links) Typescript (photocopy). / Digitized by Kansas Correctional Industries Computational linguistics
62	Evaluating distributional models of compositional semantics Batchkarov, Miroslav Manov January 2016 (has links) Distributional models (DMs) are a family of unsupervised algorithms that represent the meaning of words as vectors. They have been shown to capture interesting aspects of semantics. Recent work has sought to compose word vectors in order to model phrases and sentences. The most commonly used measure of a compositional DM's performance to date has been the degree to which it agrees with human-provided phrase similarity scores. The contributions of this thesis are three-fold. First, I argue that existing intrinsic evaluations are unreliable as they make use of small and subjective gold-standard data sets and assume a notion of similarity that is independent of a particular application. Therefore, they do not necessarily measure how well a model performs in practice. I study four commonly used intrinsic datasets and demonstrate that all of them exhibit undesirable properties. Second, I propose a novel framework within which to compare word- or phrase-level DMs in terms of their ability to support document classification. My approach couples a classifier to a DM and provides a setting where classification performance is sensitive to the quality of the DM. Third, I present an empirical evaluation of several methods for building word representations and composing them within my framework. I find that the determining factor in building word representations is data quality rather than quantity; in some cases only a small amount of unlabelled data is required to reach peak performance. Neural algorithms for building single-word representations perform better than counting-based ones regardless of what composition is used, but simple composition algorithms can outperform more sophisticated competitors. Finally, I introduce a new algorithm for improving the quality of distributional thesauri using information from repeated runs of the same non deterministic algorithm. 006.3
63	Graph-based approaches to word sense induction Hope, David Richard January 2015 (has links) This thesis is a study of Word Sense Induction (WSI), the Natural Language Processing (NLP) task of automatically discovering word meanings from text. WSI is an open problem in NLP whose solution would be of considerable benefit to many other NLP tasks. It has, however, has been studied by relatively few NLP researchers and often in set ways. Scope therefore exists to apply novel methods to the problem, methods that may improve upon those previously applied. This thesis applies a graph-theoretic approach to WSI. In this approach, word senses are identifed by finding particular types of subgraphs in word co-occurrence graphs. A number of original methods for constructing, analysing, and partitioning graphs are introduced, with these methods then incorporated into graphbased WSI systems. These systems are then shown, in a variety of evaluation scenarios, to return results that are comparable to those of the current best performing WSI systems. The main contributions of the thesis are a novel parameter-free soft clustering algorithm that runs in time linear in the number of edges in the input graph, and novel generalisations of the clustering coeficient (a measure of vertex cohesion in graphs) to the weighted case. Further contributions of the thesis include: a review of graph-based WSI systems that have been proposed in the literature; analysis of the methodologies applied in these systems; analysis of the metrics used to evaluate WSI systems, and empirical evidence to verify the usefulness of each novel method introduced in the thesis for inducing word senses. 004
64	Paraphrase identification using knowledge-lean techniques Eyecioglu Ozmutlu, Asli January 2016 (has links) This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP. Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call “knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri. The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question “To what extent can these methods be used for the purpose of a paraphrase identification task?” For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs). These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrasecorpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or fine-grained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques. Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods. 006.3
65	A corpus-based induction learning approach to natural language processing. January 1996 (has links) by Leung Chi Hong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 1996. / Includes bibliographical references (leaves 163-171). / Chapter Chapter 1. --- Introduction --- p.1 / Chapter Chapter 2. --- Background Study of Natural Language Processing --- p.9 / Chapter 2.1. --- Knowledge-based approach --- p.9 / Chapter 2.1.1. --- Morphological analysis --- p.10 / Chapter 2.1.2. --- Syntactic parsing --- p.11 / Chapter 2.1.3. --- Semantic parsing --- p.16 / Chapter 2.1.3.1. --- Semantic grammar --- p.19 / Chapter 2.1.3.2. --- Case grammar --- p.20 / Chapter 2.1.4. --- Problems of knowledge acquisition in knowledge-based approach --- p.22 / Chapter 2.2. --- Corpus-based approach --- p.23 / Chapter 2.2.1. --- Beginning of corpus-based approach --- p.23 / Chapter 2.2.2. --- An example of corpus-based application: word tagging --- p.25 / Chapter 2.2.3. --- Annotated corpus --- p.26 / Chapter 2.2.4. --- State of the art in the corpus-based approach --- p.26 / Chapter 2.3. --- Knowledge-based approach versus corpus-based approach --- p.28 / Chapter 2.4. --- Co-operation between two different approaches --- p.32 / Chapter Chapter 3. --- Induction Learning applied to Corpus-based Approach --- p.35 / Chapter 3.1. --- General model of traditional corpus-based approach --- p.36 / Chapter 3.1.1. --- Division of a problem into a number of sub-problems --- p.36 / Chapter 3.1.2. --- Solution selected from a set of predefined choices --- p.36 / Chapter 3.1.3. --- Solution selection based on a particular kind of linguistic entity --- p.37 / Chapter 3.1.4. --- Statistical correlations between solutions and linguistic entities --- p.37 / Chapter 3.1.5. --- Prediction of the best solution based on statistical correlations --- p.38 / Chapter 3.2. --- First problem in the corpus-based approach: Irrelevance in the corpus --- p.39 / Chapter 3.3. --- Induction learning --- p.41 / Chapter 3.3.1. --- General issues about induction learning --- p.41 / Chapter 3.3.2. --- Reasons of using induction learning in the corpus-based approach --- p.43 / Chapter 3.3.3. --- General model of corpus-based induction learning approach --- p.45 / Chapter 3.3.3.1. --- Preparation of positive corpus and negative corpus --- p.45 / Chapter 3.3.3.2. --- Statistical correlations between solutions and linguistic entities --- p.46 / Chapter 3.3.3.3. --- Combination of the statistical correlations obtained from the positive and negative corpora --- p.48 / Chapter 3.4. --- Second problem in the corpus-based approach: Modification of initial probabilistic approximations --- p.50 / Chapter 3.5. --- Learning feedback modification --- p.52 / Chapter 3.5.1. --- Determination of which correlation scores to be modified --- p.52 / Chapter 3.5.2. --- Determination of the magnitude of modification --- p.53 / Chapter 3.5.3. --- An general algorithm of learning feedback modification --- p.56 / Chapter Chapter 4. --- Identification of Phrases and Templates in Domain-specific Chinese Texts --- p.59 / Chapter 4.1. --- Analysis of the problem solved by the traditional corpus-based approach --- p.61 / Chapter 4.2. --- Phrase identification based on positive and negative corpora --- p.63 / Chapter 4.3. --- Phrase identification procedure --- p.64 / Chapter 4.3.1. --- Step 1: Phrase seed identification --- p.65 / Chapter 4.3.2. --- Step 2: Phrase construction from phrase seeds --- p.65 / Chapter 4.4. --- Template identification procedure --- p.67 / Chapter 4.5. --- Experiment and result --- p.70 / Chapter 4.5.1. --- Testing data --- p.70 / Chapter 4.5.2. --- Details of experiments --- p.71 / Chapter 4.5.3. --- Experimental results --- p.72 / Chapter 4.5.3.1. --- Phrases and templates identified in financial news articles --- p.72 / Chapter 4.5.3.2. --- Phrases and templates identified in political news articles --- p.73 / Chapter 4.6. --- Conclusion --- p.74 / Chapter Chapter 5. --- A Corpus-based Induction Learning Approach to Improving the Accuracy of Chinese Word Segmentation --- p.76 / Chapter 5.1. --- Background of Chinese word segmentation --- p.77 / Chapter 5.2. --- Typical methods of Chinese word segmentation --- p.78 / Chapter 5.2.1. --- Syntactic and semantic approach --- p.78 / Chapter 5.2.2. --- Statistical approach --- p.79 / Chapter 5.2.3. --- Heuristic approach --- p.81 / Chapter 5.3. --- Problems in word segmentation --- p.82 / Chapter 5.3.1. --- Chinese word definition --- p.82 / Chapter 5.3.2. --- Word dictionary --- p.83 / Chapter 5.3.3. --- Word segmentation ambiguity --- p.84 / Chapter 5.4. --- Corpus-based induction learning approach to improving word segmentation accuracy --- p.86 / Chapter 5.4.1. --- Rationale of approach --- p.87 / Chapter 5.4.2. --- Method of constructing modification rules --- p.89 / Chapter 5.5. --- Experiment and results --- p.94 / Chapter 5.6. --- Characteristics of modification rules constructed in experiment --- p.96 / Chapter 5.7. --- Experiment constructing rules for compound words with suffixes --- p.98 / Chapter 5.8. --- Relationship between modification frequency and Zipfs first law --- p.99 / Chapter 5.9. --- Problems in the approach --- p.100 / Chapter 5.10. --- Conclusion --- p.101 / Chapter Chapter 6. --- Corpus-based Induction Learning Approach to Automatic Indexing of Controlled Index Terms --- p.103 / Chapter 6.1. --- Background of automatic indexing --- p.103 / Chapter 6.1.1. --- Definition of index term and indexing --- p.103 / Chapter 6.1.2. --- Manual indexing versus automatic indexing --- p.105 / Chapter 6.1.3. --- Different approaches to automatic indexing --- p.107 / Chapter 6.2. --- Corpus-based induction learning approach to automatic indexing --- p.109 / Chapter 6.2.1. --- Fundamental concept about corpus-based automatic indexing --- p.110 / Chapter 6.2.2. --- Procedure of automatic indexing --- p.111 / Chapter 6.2.2.1. --- Learning process --- p.112 / Chapter 6.2.2.2. --- Indexing process --- p.118 / Chapter 6.3. --- Experiments of corpus-based induction learning approach to automatic indexing --- p.118 / Chapter 6.3.1. --- An experiment evaluating the complete procedures --- p.119 / Chapter 6.3.1.1. --- Testing data used in the experiment --- p.119 / Chapter 6.3.1.2. --- Details of the experiment --- p.119 / Chapter 6.3.1.3. --- Experimental result --- p.121 / Chapter 6.3.2. --- An experiment comparing with the traditional approach --- p.122 / Chapter 6.3.3. --- An experiment determining the optimal indexing score threshold --- p.124 / Chapter 6.3.4. --- An experiment measuring the precision and recall of indexing performance --- p.127 / Chapter 6.4. --- Learning feedback modification --- p.128 / Chapter 6.4.1. --- Positive feedback --- p.129 / Chapter 6.4.2. --- Negative feedback --- p.131 / Chapter 6.4.3. --- Change of indexed proportions of positive/negative training corpus in feedback iterations --- p.132 / Chapter 6.4.4. --- An experiment evaluating the learning feedback modification --- p.134 / Chapter 6.4.5. --- An experiment testing the significance factor in merging process --- p.136 / Chapter 6.5. --- Conclusion --- p.138 / Chapter Chapter 7. --- Conclusion --- p.140 / Appendix A: Some examples of identified phrases in financial news articles --- p.149 / Appendix B: Some examples of identified templates in financial news articles --- p.150 / Appendix C: Some examples of texts containing the templates in financial news articles --- p.151 / Appendix D: Some examples of identified phrases in political news articles --- p.152 / Appendix E: Some examples of identified templates in political news articles --- p.153 / Appendix F: Some examples of texts containing the templates in political news articles --- p.154 / Appendix G: Syntactic tags used in word segmentation modification rule experiment --- p.155 / Appendix H: An example of semantic approach to automatic indexing --- p.156 / Appendix I: An example of syntactic approach to automatic indexing --- p.158 / Appendix J: Samples of INSPEC and MEDLINE Records --- p.161 / Appendix K: Examples of Promoting and Demoting Words --- p.162 / References --- p.163 Chinese language--Data processing Computational linguistics
66	Uniform multilingual sentence generation using flexible lexico-grammatical resources Kozlowski, Raymond. January 2006 (has links) Thesis (Ph.D.)--University of Delaware, 2006. / Principal faculty advisors: Kathleen F. McCoy and Vijay K. Shanker, Computer & Information Sciences. Includes bibliographical references.
67	Discovering patterns in databases the cases for language, music, and unstructured data / Yip, Chi-lap. January 2000 (has links) Thesis (Ph. D.)--University of Hong Kong, 2001. / Includes bibliographical references (leaves 101-113).
68	Computergestützte Untersuchungen zur Wortbildung am Beispiel von deutschen Zeitungstexten des 19. und 20. Jahrhunderts, Müller, Bernd S. January 1969 (has links) Diss.--Marburg/Lahn. / Cover title. Bibliography: v. 1, leaves 203-210.
69	Content markup language design principles Strotmann, Andreas. Kohout, Ladislav. January 2003 (has links) Thesis (Ph. D.)--Florida State University, 2003. / Advisor: Dr. Ladislav J. Kohout, Florida State University, College of Arts and Sciences, Department of Computer Science. Title and description from dissertation home page (viewed Oct. 2, 2003). Includes bibliographical references.
70	Methods and applications of text-driven toponym resolution with indirect supervision Speriosu, Michael Adrian 24 September 2013 (has links) This thesis addresses the problem of toponym resolution. Given an ambiguous placename like Springfield in some natural language context, the task is to automatically predict the location on the earth's surface the author is referring to. Many previous efforts use hand-built heuristics to attempt to solve this problem, looking for specific words in close proximity such as Springfield, Illinois, and disambiguating any remaining toponyms to possible locations close to those already resolved. Such approaches require the data to take a fairly specific form in order to perform well, thus they often have low coverage. Some have applied machine learning to this task in an attempt to build more general resolvers, but acquiring large amounts of high quality hand-labeled training material is difficult. I discuss these and other approaches found in previous work before presenting several new toponym resolvers that rely neither on hand-labeled training material prepared explicitly for this task nor on particular co-occurrences of toponyms in close proximity in the data to be disambiguated. Some of the resolvers I develop reflect the intuition of many heuristic resolvers that toponyms nearby in text tend to (but do not always) refer to locations nearby on Earth, but do not require toponyms to occur in direct sequence with one another. I also introduce several resolvers that use the predictions of a document geolocation system (i.e. one that predicts a location for a piece of text of arbitrary length) to inform toponym disambiguation. Another resolver takes into account these document-level location predictions, knowledge of different administrative levels (country, state, city, etc.), and predictions from a logistic regression classifier trained on automatically extracted training instances from Wikipedia in a probabilistic way. It takes advantage of all content words in each toponym's context (both local window and whole document) rather than only toponyms. One resolver I build that extracts training material for a machine learned classifier from Wikipedia, taking advantage of link structure and geographic coordinates on articles, resolves 83% of toponyms in a previously introduced corpus of news articles correctly, beating the strong but simplistic population baseline. I introduce a corpus of Civil War related writings not previously used for this task on which the population baseline does poorly; combining a Wikipedia informed resolver with an algorithm that seeks to minimize the geographic scope of all predicted locations in a document achieves 86% blind test set accuracy on this dataset. After providing these high performing resolvers, I form the groundwork for more flexible and complex approaches by transforming the problem of toponym resolution into the traveling purchaser problem, modeling the probability of a location given its toponym's textual context and the geographic distribution of all locations mentioned in a document as two components of an objective function to be minimized. As one solution to this incarnation of the traveling purchaser problem, I simulate properties of ants traveling the globe and disambiguating toponyms. The ants' preferences for various kinds of behavior evolves over time, revealing underlying patterns in the corpora that other disambiguation methods do not account for. I also introduce several automated visualizations of texts that have had their toponyms resolved. Given a resolved corpus, these visualizations summarize the areas of the globe mentioned and allow the user to refer back to specific passages in the text that mention a location of interest. One visualization presented automatically generates a dynamic tour of the corpus, showing changes in the area referred to by the text as it progresses. Such visualizations are an example of a practical application of work in toponym resolution, and could be used by scholars interested in the geographic connections in any collection of text on both broad and fine-grained levels. / text Toponym resolution Semi-supervised learning Computational linguistics

Search results