Global ETD Search

41	An inheritance-based theory of the lexicon in combinatory categorial grammar McConville, Mark January 2008 (has links) This thesis proposes an extended version of the Combinatory Categorial Grammar (CCG) formalism, with the following features: 1. grammars incorporate inheritance hierarchies of lexical types, defined over a simple, feature-based constraint language 2. CCG lexicons are, or at least can be, functions from forms to these lexical types This formalism, which I refer to as ‘inheritance-driven’ CCG (I-CCG), is conceptualised as a partially model-theoretic system, involving a distinction between category descriptions and their underlying category models, with these two notions being related by logical satisfaction. I argue that the I-CCG formalism retains all the advantages of both the core CCG framework and proposed generalisations involving such things as multiset categories, unary modalities or typed feature structures. In addition, I-CCG: 1. provides non-redundant lexicons for human languages 2. captures a range of well-known implicational word order universals in terms of an acquisition-based preference for shorter grammars This thesis proceeds as follows: Chapter 2 introduces the ‘baseline’ CCG formalism, which incorporates just the essential elements of category notation, without any of the proposed extensions. Chapter 3 reviews parts of the CCG literature dealing with linguistic competence in its most general sense, showing how the formalism predicts a number of language universals in terms of either its restricted generative capacity or the prioritisation of simpler lexicons. Chapter 4 analyses the first motivation for generalising the baseline category notation, demonstrating how certain fairly simple implicational word order universals are not formally predicted by baseline CCG, although they intuitively do involve considerations of grammatical economy. Chapter 5 examines the second motivation underlying many of the customised CCG category notations — to reduce lexical redundancy, thus allowing for the construction of lexicons which assign (each sense of) open class words and morphemes to no more than one lexical category, itself denoted by a non-composite lexical type. Chapter 6 defines the I-CCG formalism, incorporating into the notion of a CCG grammar both a type hierarchy of saturated category symbols and an inheritance hierarchy of constrained lexical types. The constraint language is a simple, feature-based, highly underspecified notation, interpreted against an underlying notion of category models — this latter point is crucial, since it allows us to abstract away from any particular inference procedure and focus on the category notation itself. I argue that the partially model-theoretic I-CCG formalism solves the lexical redundancy problem fairly definitively, thereby subsuming all the other proposed variant category notations. Chapter 7 demonstrates that the I-CCG formalism also provides the beginnings of a theory of the CCG lexicon in a stronger sense — with just a small number of substantive assumptions about types, it can be shown to formally predict many implicational word order universals in terms of an acquisition-based preference for simpler lexical inheritance hierarchies, i.e. those with fewer types and fewer constraints. Chapter 8 concludes the thesis. 006.3
42	Fördomsfulla associationer i en svenskvektorbaserad semantisk modell / Bias in a Swedish Word Embedding Jonasson, Michael January 2019 (has links) Semantiska vektormodeller är en kraftfull teknik där ords mening kan representeras av vektorervilka består av siffror. Vektorerna tillåter geometriska operationer vilka fångar semantiskt viktigaförhållanden mellan orden de representerar. I denna studie implementeras och appliceras WEAT-metoden för att undersöka om statistiska förhållanden mellan ord som kan uppfattas somfördomsfulla existerar i en svensk semantisk vektormodell av en svensk nyhetstidning. Resultatetpekar på att ordförhållanden i vektormodellen har förmågan att återspegla flera av de sedantidigare IAT-dokumenterade fördomar som undersöktes. I studien implementeras och applicerasockså WEFAT-metoden för att undersöka vektormodellens förmåga att representera två faktiskastatistiska samband i verkligheten, vilket görs framgångsrikt i båda undersökningarna. Resultatenav studien som helhet ger stöd till metoderna som används och belyser samtidigt problematik medatt använda semantiska vektormodeller i språkteknologiska applikationer. / Word embeddings are a powerful technique where word meaning can be represented by vectors containing actual numbers. The vectors allow geometric operations that capture semantically important relationships between the words. In this study WEAT is applied in order to examine whether statistical properties of words pertaining to bias can be found in a swedish word embedding trained on a corpus from a swedish newspaper. The results shows that the word embedding can represent several of the IAT documented biases that where tested. A second method, WEFAT, is applied to the word embedding in order to explore the embeddings ability to represent actual statistical properties, which is also done successfully. The results from this study lends support to the validity of both methods aswell as illuminating the issue of problematic relationships between words in word embeddings.
43	Sentiment Analysis of Equity Analyst Research Reports using Convolutional Neural Networks Olof, Löfving January 2019 (has links) Natural language processing, a subfield of artificial intelligence and computer science, has recently been of great research interest due to the vast amount of information created on the internet in the modern era. One of the main natural language processing areas concerns sentiment analysis. This is a field that studies the polarity of human natural language and generally tries to categorize it as either positive, negative or neutral. In this thesis, sentiment analysis has been applied to research reports written by equity analysts. The objective has been to investigate if there exist a distinct distribution of the reports and if one is able to classify sentiment in these reports. The thesis consist of two parts; firstly investigating possibilities on how to divide the reports into different sentiment labelling regimes and secondly categorizing the sentiment using machine learning techniques. Logistic regression as well as several convolutional neural network structures has been used to classify the sentiment. Working with textual data requires the mapping of text to real valued values called features. Several feature extraction methods has been investigated including Bag of Words, term frequency-inverse document frequency and Word2vec. Out of the tested labelling regimes, classifying the documents using upgrades and downgrades of report recommendation shows the most promising potential. For this regime, the convolutional neural network architectures outperform logistic regression by a significant margin. Out of the networks tested, a double input channel utilizing two different Word2vec representations performs the best. The two different representations originate from different sources; one from the set of equity research reports and the other trained by the Google Brain team on an extensive Google news data set. This suggests that using one representation that represent topic specific words and one that is better at representing more common words enhances classification performance.
44	Automatic Error Detection and Correction in Neural Machine Translation : A comparative study of Swedish to English and Greek to English Papadopoulou, Anthi January 2019 (has links) Automatic detection and automatic correction of machine translation output are important steps to ensure an optimal quality of the final output. In this work, we compared the output of neural machine translation of two different language pairs, Swedish to English and Greek to English. This comparison was made using common machine translation metrics (BLEU, METEOR, TER) and syntax-related ones (POSBLEU, WPF, WER on POS classes). It was found that neither common metrics nor purely syntax-related ones were able to capture the quality of the machine translation output accurately, but the decomposition of WER over POS classes was the most informative one. A sample of each language was taken, so as to aid in the comparison between manual and automatic error categorization of five error categories, namely reordering errors, inflectional errors, missing and extra words, and incorrect lexical choices. Both Spearman’s ρ and Pearson’s r showed that there is a good correlation with human judgment with values above 0.9. Finally, based on the results of this error categorization, automatic post editing rules were implemented and applied, and their performance was checked against the sample, and the rest of the data set, showing varying results. The impact on the sample was greater, showing improvement in all metrics, while the impact on the rest of the data set was negative. An investigation of that, alongside the fact that correction was not possible for Greek due to extremely free reference translations and lack of error patterns in spoken speech, reinforced the belief that automatic post-editing is tightly connected to consistency in the reference translation, while also proving that in machine translation output handling, potentially more than one reference translations would be needed to ensure better results.
45	Transcription of Historical Encrypted Manuscripts : Evaluation of an automatic interactive transcription tool. Johansson, Kajsa January 2019 (has links) Countless of historical sources are saved in national libraries and archives all over the world and contain important information about our history. Some of these sources are encrypted to prevent people from reading it. This thesis examines a semi-automated Interactive transcription Tool based on unsupervised learning without any labelled training data that has been developed for transcription of encrypted sources and compares it to manual transcription. The interactive transcription tool is based on handwritten text recognition techniques and the system identifies cluster of symbols based on similarity measures. The tool is evaluated on ciphers with number sequences that have previously been transcribed manually to compare how well the transcription tool performs. The weaknesses of the tool are described and suggestions on how the tool can be improved are proposed. Transcription based on HTR techniques and clustering shows promising results and the unsupervised method based on clustering should be further investigated on ciphers with various symbol sets.
46	The design and implementation of PRONTO processor for natural text organization Anderson, Steven Michael January 2010 (has links) Typescript (photocopy). / Digitized by Kansas Correctional Industries Computational linguistics
47	Evaluating distributional models of compositional semantics Batchkarov, Miroslav Manov January 2016 (has links) Distributional models (DMs) are a family of unsupervised algorithms that represent the meaning of words as vectors. They have been shown to capture interesting aspects of semantics. Recent work has sought to compose word vectors in order to model phrases and sentences. The most commonly used measure of a compositional DM's performance to date has been the degree to which it agrees with human-provided phrase similarity scores. The contributions of this thesis are three-fold. First, I argue that existing intrinsic evaluations are unreliable as they make use of small and subjective gold-standard data sets and assume a notion of similarity that is independent of a particular application. Therefore, they do not necessarily measure how well a model performs in practice. I study four commonly used intrinsic datasets and demonstrate that all of them exhibit undesirable properties. Second, I propose a novel framework within which to compare word- or phrase-level DMs in terms of their ability to support document classification. My approach couples a classifier to a DM and provides a setting where classification performance is sensitive to the quality of the DM. Third, I present an empirical evaluation of several methods for building word representations and composing them within my framework. I find that the determining factor in building word representations is data quality rather than quantity; in some cases only a small amount of unlabelled data is required to reach peak performance. Neural algorithms for building single-word representations perform better than counting-based ones regardless of what composition is used, but simple composition algorithms can outperform more sophisticated competitors. Finally, I introduce a new algorithm for improving the quality of distributional thesauri using information from repeated runs of the same non deterministic algorithm. 006.3
48	Graph-based approaches to word sense induction Hope, David Richard January 2015 (has links) This thesis is a study of Word Sense Induction (WSI), the Natural Language Processing (NLP) task of automatically discovering word meanings from text. WSI is an open problem in NLP whose solution would be of considerable benefit to many other NLP tasks. It has, however, has been studied by relatively few NLP researchers and often in set ways. Scope therefore exists to apply novel methods to the problem, methods that may improve upon those previously applied. This thesis applies a graph-theoretic approach to WSI. In this approach, word senses are identifed by finding particular types of subgraphs in word co-occurrence graphs. A number of original methods for constructing, analysing, and partitioning graphs are introduced, with these methods then incorporated into graphbased WSI systems. These systems are then shown, in a variety of evaluation scenarios, to return results that are comparable to those of the current best performing WSI systems. The main contributions of the thesis are a novel parameter-free soft clustering algorithm that runs in time linear in the number of edges in the input graph, and novel generalisations of the clustering coeficient (a measure of vertex cohesion in graphs) to the weighted case. Further contributions of the thesis include: a review of graph-based WSI systems that have been proposed in the literature; analysis of the methodologies applied in these systems; analysis of the metrics used to evaluate WSI systems, and empirical evidence to verify the usefulness of each novel method introduced in the thesis for inducing word senses. 004
49	Paraphrase identification using knowledge-lean techniques Eyecioglu Ozmutlu, Asli January 2016 (has links) This research addresses the problem of identification of sentential paraphrases; that is, the ability of an estimator to predict well whether two sentential text fragments are paraphrases. The paraphrase identification task has practical importance in the Natural Language Processing (NLP) community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods for identifying paraphrases should help to improve the performance of NLP systems that require language understanding. This includes key applications such as machine translation, information retrieval and question answering amongst others. Over the course of the last decade, a growing body of research has been conducted on paraphrase identification and it has become an individual working area of NLP. Our objective is to investigate whether techniques concentrating on automated understanding of text requiring less resource may achieve results comparable to methods employing more sophisticated NLP processing tools and other resources. These techniques, which we call “knowledge-lean”, range from simple, shallow overlap methods based on lexical items or n-grams through to more sophisticated methods that employ automatically generated distributional thesauri. The work begins by focusing on techniques that exploit lexical overlap and text-based statistical techniques that are much less in need of NLP tools. We investigate the question “To what extent can these methods be used for the purpose of a paraphrase identification task?” For the two gold standard data, we obtained competitive results on the Microsoft Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with support vector machines (SVMs). These techniques do not require any language specific tools or external resources and appear to perform well without the need to normalise colloquial language such as that found on Twitter. It was natural to extend the scope of the research and to consider experimenting on another language, which is poor in resources. The scarcity of available paraphrase data led us to construct our own corpus; we have constructed a paraphrasecorpus in Turkish. This corpus is relatively small but provides a representative collection, including a variety of texts. While there is still debate as to whether a binary or fine-grained judgement satisfies a paraphrase corpus, we chose to provide data for a sentential textual similarity task by agreeing on fine-grained scoring, knowing that this could be converted to binary scoring, but not the other way around. The correlation between the results from different corpora is promising. Therefore, it can be surmised that languages poor in resources can benefit from knowledge-lean techniques. Discovering the strengths of knowledge-lean techniques extended with a new perspective to techniques that use distributional statistical features of text by representing each word as a vector (word2vec). While recent research focuses on larger fragments of text with word2vec, such as phrases, sentences and even paragraphs, a new approach is presented by introducing vectors of character n-grams that carry the same attributes as word vectors. The proposed method has the ability to capture syntactic relations as well as semantic relations without semantic knowledge. This is proven to be competitive on Twitter compared to more sophisticated methods. 006.3
50	A corpus-based induction learning approach to natural language processing. January 1996 (has links) by Leung Chi Hong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 1996. / Includes bibliographical references (leaves 163-171). / Chapter Chapter 1. --- Introduction --- p.1 / Chapter Chapter 2. --- Background Study of Natural Language Processing --- p.9 / Chapter 2.1. --- Knowledge-based approach --- p.9 / Chapter 2.1.1. --- Morphological analysis --- p.10 / Chapter 2.1.2. --- Syntactic parsing --- p.11 / Chapter 2.1.3. --- Semantic parsing --- p.16 / Chapter 2.1.3.1. --- Semantic grammar --- p.19 / Chapter 2.1.3.2. --- Case grammar --- p.20 / Chapter 2.1.4. --- Problems of knowledge acquisition in knowledge-based approach --- p.22 / Chapter 2.2. --- Corpus-based approach --- p.23 / Chapter 2.2.1. --- Beginning of corpus-based approach --- p.23 / Chapter 2.2.2. --- An example of corpus-based application: word tagging --- p.25 / Chapter 2.2.3. --- Annotated corpus --- p.26 / Chapter 2.2.4. --- State of the art in the corpus-based approach --- p.26 / Chapter 2.3. --- Knowledge-based approach versus corpus-based approach --- p.28 / Chapter 2.4. --- Co-operation between two different approaches --- p.32 / Chapter Chapter 3. --- Induction Learning applied to Corpus-based Approach --- p.35 / Chapter 3.1. --- General model of traditional corpus-based approach --- p.36 / Chapter 3.1.1. --- Division of a problem into a number of sub-problems --- p.36 / Chapter 3.1.2. --- Solution selected from a set of predefined choices --- p.36 / Chapter 3.1.3. --- Solution selection based on a particular kind of linguistic entity --- p.37 / Chapter 3.1.4. --- Statistical correlations between solutions and linguistic entities --- p.37 / Chapter 3.1.5. --- Prediction of the best solution based on statistical correlations --- p.38 / Chapter 3.2. --- First problem in the corpus-based approach: Irrelevance in the corpus --- p.39 / Chapter 3.3. --- Induction learning --- p.41 / Chapter 3.3.1. --- General issues about induction learning --- p.41 / Chapter 3.3.2. --- Reasons of using induction learning in the corpus-based approach --- p.43 / Chapter 3.3.3. --- General model of corpus-based induction learning approach --- p.45 / Chapter 3.3.3.1. --- Preparation of positive corpus and negative corpus --- p.45 / Chapter 3.3.3.2. --- Statistical correlations between solutions and linguistic entities --- p.46 / Chapter 3.3.3.3. --- Combination of the statistical correlations obtained from the positive and negative corpora --- p.48 / Chapter 3.4. --- Second problem in the corpus-based approach: Modification of initial probabilistic approximations --- p.50 / Chapter 3.5. --- Learning feedback modification --- p.52 / Chapter 3.5.1. --- Determination of which correlation scores to be modified --- p.52 / Chapter 3.5.2. --- Determination of the magnitude of modification --- p.53 / Chapter 3.5.3. --- An general algorithm of learning feedback modification --- p.56 / Chapter Chapter 4. --- Identification of Phrases and Templates in Domain-specific Chinese Texts --- p.59 / Chapter 4.1. --- Analysis of the problem solved by the traditional corpus-based approach --- p.61 / Chapter 4.2. --- Phrase identification based on positive and negative corpora --- p.63 / Chapter 4.3. --- Phrase identification procedure --- p.64 / Chapter 4.3.1. --- Step 1: Phrase seed identification --- p.65 / Chapter 4.3.2. --- Step 2: Phrase construction from phrase seeds --- p.65 / Chapter 4.4. --- Template identification procedure --- p.67 / Chapter 4.5. --- Experiment and result --- p.70 / Chapter 4.5.1. --- Testing data --- p.70 / Chapter 4.5.2. --- Details of experiments --- p.71 / Chapter 4.5.3. --- Experimental results --- p.72 / Chapter 4.5.3.1. --- Phrases and templates identified in financial news articles --- p.72 / Chapter 4.5.3.2. --- Phrases and templates identified in political news articles --- p.73 / Chapter 4.6. --- Conclusion --- p.74 / Chapter Chapter 5. --- A Corpus-based Induction Learning Approach to Improving the Accuracy of Chinese Word Segmentation --- p.76 / Chapter 5.1. --- Background of Chinese word segmentation --- p.77 / Chapter 5.2. --- Typical methods of Chinese word segmentation --- p.78 / Chapter 5.2.1. --- Syntactic and semantic approach --- p.78 / Chapter 5.2.2. --- Statistical approach --- p.79 / Chapter 5.2.3. --- Heuristic approach --- p.81 / Chapter 5.3. --- Problems in word segmentation --- p.82 / Chapter 5.3.1. --- Chinese word definition --- p.82 / Chapter 5.3.2. --- Word dictionary --- p.83 / Chapter 5.3.3. --- Word segmentation ambiguity --- p.84 / Chapter 5.4. --- Corpus-based induction learning approach to improving word segmentation accuracy --- p.86 / Chapter 5.4.1. --- Rationale of approach --- p.87 / Chapter 5.4.2. --- Method of constructing modification rules --- p.89 / Chapter 5.5. --- Experiment and results --- p.94 / Chapter 5.6. --- Characteristics of modification rules constructed in experiment --- p.96 / Chapter 5.7. --- Experiment constructing rules for compound words with suffixes --- p.98 / Chapter 5.8. --- Relationship between modification frequency and Zipfs first law --- p.99 / Chapter 5.9. --- Problems in the approach --- p.100 / Chapter 5.10. --- Conclusion --- p.101 / Chapter Chapter 6. --- Corpus-based Induction Learning Approach to Automatic Indexing of Controlled Index Terms --- p.103 / Chapter 6.1. --- Background of automatic indexing --- p.103 / Chapter 6.1.1. --- Definition of index term and indexing --- p.103 / Chapter 6.1.2. --- Manual indexing versus automatic indexing --- p.105 / Chapter 6.1.3. --- Different approaches to automatic indexing --- p.107 / Chapter 6.2. --- Corpus-based induction learning approach to automatic indexing --- p.109 / Chapter 6.2.1. --- Fundamental concept about corpus-based automatic indexing --- p.110 / Chapter 6.2.2. --- Procedure of automatic indexing --- p.111 / Chapter 6.2.2.1. --- Learning process --- p.112 / Chapter 6.2.2.2. --- Indexing process --- p.118 / Chapter 6.3. --- Experiments of corpus-based induction learning approach to automatic indexing --- p.118 / Chapter 6.3.1. --- An experiment evaluating the complete procedures --- p.119 / Chapter 6.3.1.1. --- Testing data used in the experiment --- p.119 / Chapter 6.3.1.2. --- Details of the experiment --- p.119 / Chapter 6.3.1.3. --- Experimental result --- p.121 / Chapter 6.3.2. --- An experiment comparing with the traditional approach --- p.122 / Chapter 6.3.3. --- An experiment determining the optimal indexing score threshold --- p.124 / Chapter 6.3.4. --- An experiment measuring the precision and recall of indexing performance --- p.127 / Chapter 6.4. --- Learning feedback modification --- p.128 / Chapter 6.4.1. --- Positive feedback --- p.129 / Chapter 6.4.2. --- Negative feedback --- p.131 / Chapter 6.4.3. --- Change of indexed proportions of positive/negative training corpus in feedback iterations --- p.132 / Chapter 6.4.4. --- An experiment evaluating the learning feedback modification --- p.134 / Chapter 6.4.5. --- An experiment testing the significance factor in merging process --- p.136 / Chapter 6.5. --- Conclusion --- p.138 / Chapter Chapter 7. --- Conclusion --- p.140 / Appendix A: Some examples of identified phrases in financial news articles --- p.149 / Appendix B: Some examples of identified templates in financial news articles --- p.150 / Appendix C: Some examples of texts containing the templates in financial news articles --- p.151 / Appendix D: Some examples of identified phrases in political news articles --- p.152 / Appendix E: Some examples of identified templates in political news articles --- p.153 / Appendix F: Some examples of texts containing the templates in political news articles --- p.154 / Appendix G: Syntactic tags used in word segmentation modification rule experiment --- p.155 / Appendix H: An example of semantic approach to automatic indexing --- p.156 / Appendix I: An example of syntactic approach to automatic indexing --- p.158 / Appendix J: Samples of INSPEC and MEDLINE Records --- p.161 / Appendix K: Examples of Promoting and Demoting Words --- p.162 / References --- p.163 Chinese language--Data processing Computational linguistics

Search results