Global ETD Search

241	A Probabilistic Tagging Module Based on Surface Pattern Matching Eklund, Robert January 1993 (has links) A problem with automatic tagging and lexical analysis is that it is never 100 % accurate. In order to arrive at better figures, one needs to study the character of what is left untagged by automatic taggers. In this paper untagged residue outputted by the automatic analyser SWETWOL (Karlsson 1992) at Helsinki is studied. SWETWOL assigns tags to words in Swedish texts mainly through dictionary lookup. The contents of the untagged residue files are described and discussed, and possible ways of solving different problems are proposed. One method of tagging residual output is proposed and implemented: the left-stripping method, through which untagged words are bereaved their left-most letters, searched in a dictionary, and if found, tagged according to the information found in the said dictionary. If the stripped word is not found in the dictionary, a match is searched in ending lexica containing statistical information about word classes associated with that particular word form (i.e., final letter cluster, be this a grammatical suffix or not), and the relative frequency of each word class. If a match is found, the word is given graduated tagging according to the statistical information in the ending lexicon. If a match is not found, the word is stripped of what is now its left-most letter and is recursively searched in a dictionary and ending lexica (in that order). The ending lexica employed in this paper are retrieved from a reversed version of Nusvensk Frekvensordbok (Allén 1970), and contain endings of between one and seven letters. The contents of the ending lexica are to a certain degree described and discussed. The programs working according to the principles described are run on files of untagged residual output. Appendices include, among other things, LISP source code, untagged and tagged files, the ending lexica containing one and two letter endings and excerpts from ending lexica containing three to seven letters. Tagging computational linguistics word-class probabilistic morphology swetwol statistical corpus linguistics corpora endings suffixes word class frequency lexical analysis
242	Characterizing, classifying and transforming language model distributions Kniele, Annika January 2023 (has links) Large Language Models (LLMs) have become ever larger in recent years, typically demonstrating improved performance as the number of parameters increases. This thesis investigates how the probability distributions output by language models differ depending on the size of the model. For this purpose, three features for capturing the differences between the distributions are defined, namely the difference in entropy, the difference in probability mass in different slices of the distribution, and the difference in the number of tokens covering the top-p probability mass. The distributions are then put into different distribution classes based on how they differ from the distributions of the differently-sized model. Finally, the distributions are transformed to be more similar to the distributions of the other model. The results suggest that classifying distributions before transforming them, and adapting the transformations based on which class a distribution is in, improves the transformation results. It is also shown that letting a classifier choose the class label for each distribution yields better results than using random labels. Furthermore, the findings indicate that transforming the distributions using entropy and the number of tokens in the top-p probability mass makes the distributions more similar to the targets, while transforming them based on the probability mass of individual slices of the distributions makes the distributions more dissimilar. Large Language Models (LLMs) GPT BERT NLP deep learning machine learning computational linguistics language technology
243	Homograph Disambiguation and Diacritization for Arabic Text-to-Speech Using Neural Networks / Homografdisambiguering och diakritisering för arabiska text-till-talsystem med hjälp av neurala nätverk Lameris, Harm January 2021 (has links) Pre-processing Arabic text for Text-to-Speech (TTS) systems poses major challenges, as Arabic omits short vowels in writing. This omission leads to a large number of homographs, and means that Arabic text needs to be diacritized to disambiguate these homographs, in order to be matched up with the intended pronunciation. Diacritizing Arabic has generally been achieved by using rule-based, statistical, or hybrid methods that combine rule-based and statistical methods. Recently, diacritization methods involving deep learning have shown promise in reducing error rates. These deep-learning methods are not yet commonly used in TTS engines, however. To examine neural diacritization methods for use in TTS engines, we normalized and pre-processed a version of the Tashkeela corpus, a large diacritized corpus containing largely Classical Arabic texts, for TTS purposes. We then trained and tested three state-of-the-art Recurrent-Neural-Network-based models on this data set. Additionally we tested these models on the Wiki News corpus, a test set that contains Modern Standard Arabic (MSA) news articles and thus more closely resembles most TTS queries. The models were evaluated by comparing the Diacritic Error Rate (DER) and Word Error Rate (WER) achieved for each data set to one another and to the DER and WER reported in the original papers. Moreover, the per-diacritic accuracy was examined, and a manual evaluation was performed. For the Tashkeela corpus, all models achieved a lower DER and WER than reported in the original papers. This was largely the result of using more training data in addition to the TTS pre-processing steps that were performed on the data. For the Wiki News corpus, the error rates were higher, largely due to the domain gap between the data sets. We found that for both data sets the models overfit on common patterns and the most common diacritic. For the Wiki News corpus the models struggled with Named Entities and loanwords. Purely neural models generally outperformed the model that combined deep learning with rule-based and statistical corrections. These findings highlight the usability of deep learning methods for Arabic diacritization in TTS engines as well as the need for diacritized corpora that are more representative of Modern Standard Arabic. Computational Linguistics Language Technology Diacritization Neural Networks Deep Learning Arabic Natural Language Processing NLP Text-to-Speech TTS Homograph Disambiguation
244	Iterated learning framework for unsupervised part-of-speech induction Christodoulopoulos, Christos January 2013 (has links) Computational approaches to linguistic analysis have been used for more than half a century. The main tools come from the field of Natural Language Processing (NLP) and are based on rule-based or corpora-based (supervised) methods. Despite the undeniable success of supervised learning methods in NLP, they have two main drawbacks: on the practical side, it is expensive to produce the manual annotation (or the rules) required and it is not easy to find annotators for less common languages. A theoretical disadvantage is that the computational analysis produced is tied to a specific theory or annotation scheme. Unsupervised methods offer the possibility to expand our analyses into more resourcepoor languages, and to move beyond the conventional linguistic theories. They are a way of observing patterns and regularities emerging directly from the data and can provide new linguistic insights. In this thesis I explore unsupervised methods for inducing parts of speech across languages. I discuss the challenges in evaluation of unsupervised learning and at the same time, by looking at the historical evolution of part-of-speech systems, I make the case that the compartmentalised, traditional pipeline approach of NLP is not ideal for the task. I present a generative Bayesian system that makes it easy to incorporate multiple diverse features, spanning different levels of linguistic structure, like morphology, lexical distribution, syntactic dependencies and word alignment information that allow for the examination of cross-linguistic patterns. I test the system using features provided by unsupervised systems in a pipeline mode (where the output of one system is the input to another) and show that the performance of the baseline (distributional) model increases significantly, reaching and in some cases surpassing the performance of state-of-the-art part-of-speech induction systems. I then turn to the unsupervised systems that provided these sources of information (morphology, dependencies, word alignment) and examine the way that part-of-speech information influences their inference. Having established a bi-directional relationship between each system and my part-of-speech inducer, I describe an iterated learning method, where each component system is trained using the output of the other system in each iteration. The iterated learning method improves the performance of both component systems in each task. Finally, using this iterated learning framework, and by using parts of speech as the central component, I produce chains of linguistic structure induction that combine all the component systems to offer a more holistic view of NLP. To show the potential of this multi-level system, I demonstrate its use ‘in the wild’. I describe the creation of a vastly multilingual parallel corpus based on 100 translations of the Bible in a diverse set of languages. Using the multi-level induction system, I induce cross-lingual clusters, and provide some qualitative results of my approach. I show that it is possible to discover similarities between languages that correspond to ‘hidden’ morphological, syntactic or semantic elements. 006.3
245	Finite-state Machine Construction Methods and Algorithms for Phonology and Morphology Hulden, Mans January 2009 (has links) This dissertation is concerned with finite state machine-based technology for modeling natural language. Finite-state machines have proven to be efficient computational devices in modeling natural language phenomena in morphology and phonology. Because of their mathematical closure properties, finite-state machines can be manipulated and combined in many flexible ways that closely resemble formalisms used in different areas of linguistics to describe natural language. The use of finite-state transducers in constructing natural language parsers and generators has proven to be a versatile approach to describing phonological alternation, morphological constraints and morphotactics, and syntactic phenomena on the phrase level.The main contributions of this dissertation are the development of a new model of multitape automata, the development of a new logic formalism that can substitute for regular expressions in constructing complex automata, and adaptations of these techniques to solving classical construction problems relating to finite-state transducers, such as modeling reduplication and complex phonological replacement rules.The multitape model presented here goes hand-in-hand with the logic formalism, the latter being a necessary step to constructing the former. These multitape automata can then be used to create entire morphological and phonological grammars, and can also serve as a neutral intermediate tool to ease the construction of automata for other purposes.The construction of large-scale finite-state models for natural language grammars is a very delicate process. Making any solution practicable requires great care in the efficient implementation of low-level tasks such as converting regular expressions, logical statements, sets of constraints, and replacement rules to automata or finite transducers. To support the overall endeavor of showing the practicability of the logical and multitape extensions proposed in this thesis, a detailed treatment of efficient implementation of finite-state construction algorithms for natural language purposes is also presented. computational linguistics finite state machines morphology multitape automata predicate logic reduplication in morphology
246	Automatic error detection in non-native English De Felice, Rachele January 2008 (has links) This thesis describes the development of Dapper (`Determiner And PrePosition Error Recogniser'), a system designed to automatically acquire models of occurrence for English prepositions and determiners to allow for the detection and correction of errors in their usage, especially in the writing of non-native speakers of the language. Prepositions and determiners are focused on because they are parts of speech whose usage is particularly challenging to acquire, both for students of the language and for natural language processing tools. The work presented in this thesis proposes to address this problem by developing a system which can acquire models of correct preposition and determiner occurrence, and can use this knowledge to identify divergences from these models as errors. The contexts of these parts of speech are represented by a sophisticated feature set, incorporating a variety of semantic and syntactic elements. DAPPER is found to perform well on preposition and determiner selection tasks in correct native English text. Results on each preposition and determiner are discussed in detail to understand the possible reasons for variations in performance, and whether these are due to problems with the structure of DAPPER or to deeper linguistic reasons. An in-depth analysis of all features used is also offered, quantifying the contribution of each feature individually. This can help establish if the decision to include complex semantic and syntactic features is justified in the context of this task. Finally, the performance of DAPPER on non-native English text is assessed. The system is found to be robust when applied to text which does not contain any preposition or determiner errors. On an error correction task, results are mixed: DAPPER shows promising results on preposition selection and determiner confusion (definite vs. indefinite) errors, but is less successful in detecting errors involving missing or extraneous determiners. Several characteristics of learner writing are described, to gain a clearer understanding of what problems arise when natural language processing tools are used with this kind of text. It is concluded that the construction of contextual models is a viable approach to the task of preposition and determiner selection, despite outstanding issues pertaining to the domain of non-native writing. 005.3
247	Generating affective natural language for parents of neonatal infants Mahamood, Saad Ali January 2010 (has links) The thesis presented here describes original research in the field of Natural Language Generation (NLG). NLG is the subfield of artificial intelligence that is concerned with the automatic production of documents from underlying data. This thesis in particular focuses on developing new and novel methods for generating text that takes into consideration the recipient’s level of stress as a factor to adapt the resultant textural output. This consideration of taking the recipient level of stress was particularly salient due to the domain that this research was conducted under; providing information for parents of pre-term infants during neonatal intensive care (NICU). A highly technical and stressful environment for parents where emotional sensitivity must be shown for the nature of information presented. We have investigated the emotional and informational needs of these parents through an extensive past literature review and two separate research studies with former and current NICU parents. The NLG system built for this research was called BabyTalk Family (BT-Family). A system that can produce a textual summary of medical events that has occurred for a baby in NICU in last twenty-four hours for parents. The novelty of this system is that is capable of estimating the level of stress of the recipient and by using several affective NLG strategies it is able to tailor it’s output for a stressed audience. Unlike traditional NLG systems where the output would remain unchanged regardless of emotional state of the recipient. The key innovation in this system was the integration of several affective strategies in the Document Planner for tailoring textual output for stress recipients. BT-Family’s output was evaluated with thirteen parents that previously had baby in neonatal care. We developed a methodology for an evaluation that involved a direct comparison between stressed and unstressed text for the same given medical scenario for variables such as preference, understandability, helpfulness, and emotional appropriateness. The results, obtained showed the parents overwhelming preferred the stressed text for all of the variables measured. 006.3
248	The semantics/pragmatics distinction : a defence of Grice Greenhall, Owen F. R. January 2006 (has links) The historical development of Morris’ tripartite distinction between syntax, semantics and pragmatics does not follow a smooth path. Examining definitions of the terms ‘semantic’ and ‘pragmatic’ and the phenomena they have been used to describe, provides insight into alternative approaches to the semantics/pragmatics distinction. Paul Grice’s work receives particular attention and taxonomy of philosophical positions, roughly divisible into content minimalist and maximalist groups, is set up. Grice’s often neglected theory of conventional implicature is defended from objections, various tests for the presence of conventional implicature are assessed and the linguistic properties of conventional implicature defined. Once rehabilitated, the theoretical utility of conventional implicature is demonstrated via a case study of the semantic import of the gender and number of pronouns in English. The better-known theory of conversational implicature is also examined and refined. New linguistic tests for such implicatures are devised and the refined theory is applied to scalar terms. A pragmatic approach to scalar implicatures is proposed and shown to fare better than alternatives presented by Uli Sauerland, Stephen Levinson and Gennaro Chierchia. With the details of the theory conversational implicature established, the use made of Grice’s tool in the work of several philosophers is critically evaluated. Kent Bach’s minimalist approach to quantifier domain restriction is examined and criticised. Also, the linguistic evidence for semantic minimalism provided by Herman Cappelen and Ernie Lepore is found wanting. Finally, a content maximalist approach to quantifier domain restriction is proposed. The approach differs from other context maximalist theories, such as Jason Stanley’s, in relying on semantically unarticulated constituents. Stanley’s arguments against such theories are examined. Further applications of the approach are briefly surveyed. 149.94
249	ASKNet : automatically creating semantic knowledge networks from natural language text Harrington, Brian January 2009 (has links) This thesis details the creation of ASKNet (Automated Semantic Knowledge Network), a system for creating large scale semantic networks from natural language texts. Using ASKNet as an example, we will show that by using existing natural language processing (NLP) tools, combined with a novel use of spreading activation theory, it is possible to efficiently create high quality semantic networks on a scale never before achievable. The ASKNet system takes naturally occurring English text (e.g., newspaper articles), and processes them using existing NLP tools. It then uses the output of those tools to create semantic network fragments representing the meaning of each sentence in the text. Those fragments are then combined by a spreading activation based algorithm that attempts to decide which portions of the networks refer to the same real-world entity. This allows ASKNet to combine the small fragments together into a single cohesive resource, which has more expressive power than the sum of its parts. Systems aiming to build semantic resources have typically either overlooked information integration completely, or else dismissed it as being AI-complete, and thus unachievable. In this thesis we will show that information integration is both an integral component of any semantic resource, and achievable through a combination of NLP technologies and novel applications of spreading activation theory. While extraction and integration of all knowledge within a text may be AI-complete, we will show that by processing large quantities of text efficiently, we can compensate for minor processing errors and missed relations with volume and creation speed. If relations are too difficult to extract, or we are unsure which nodes should integrate at any given stage, we can simply leave them to be picked up later when we have more information or come across a document which explains the concept more clearly. ASKNet is primarily designed as a proof of concept system. However, this thesis will show that it is capable of creating semantic networks larger than any existing similar resource in a matter of days, and furthermore that the networks it creates of are sufficient quality to be used for real world tasks. We will demonstrate that ASKNet can be used to judge semantic relatedness of words, achieving results comparable to the best state-of-the-art systems. 006.3
250	The language of humour Mihalcea, Rada January 2010 (has links) Humour is one of the most interesting and puzzling aspects of human behaviour. Despite the attention it has received from fields such as philosophy, linguistics, and psychology, there have been only few attempts to create computational models for humour recognition and analysis. In this thesis, I use corpus-based approaches to formulate and test hypotheses concerned with the processing of verbal humour. The thesis makes two important contributions. First, it brings empirical evidence that computational approaches can be successfully applied to the task of humour recognition. Through experiments performed on very large data sets, I show that automatic classification techniques can be effectively used to distinguish between humorous and non-humorous texts, using content-based features or models of incongruity. Moreover, using a method for measuring feature saliency, I identify and validate several dominant word classes that can be used to characterize humorous text. Second, the thesis provides corpus-based support toward the validity of previously formulated linguistic theories, indicating that humour is primarily due to incongruity and humour-specific language. Experiments performed on collections of verbal humour show that both incongruity and content-based features can be successfully used to model humour, and that these features are even more effective when used in tandem. 410

Search results