Global ETD Search

191	An Exploration of the Word2vec Algorithm: Creating a Vector Representation of a Language Vocabulary that Encodes Meaning and Usage Patterns in the Vector Space Structure Le, Thu Anh 05 1900 (has links) This thesis is an exloration and exposition of a highly efficient shallow neural network algorithm called word2vec, which was developed by T. Mikolov et al. in order to create vector representations of a language vocabulary such that information about the meaning and usage of the vocabulary words is encoded in the vector space structure. Chapter 1 introduces natural language processing, vector representations of language vocabularies, and the word2vec algorithm. Chapter 2 reviews the basic mathematical theory of deterministic convex optimization. Chapter 3 provides background on some concepts from computer science that are used in the word2vec algorithm: Huffman trees, neural networks, and binary cross-entropy. Chapter 4 provides a detailed discussion of the word2vec algorithm itself and includes a discussion of continuous bag of words, skip-gram, hierarchical softmax, and negative sampling. Finally, Chapter 5 explores some applications of vector representations: word categorization, analogy completion, and language translation assistance. Word2vec neural networks vector representation of language skip-gram continuous bag of words hierarchical softmax negative sampling language translation Huffman trees binary cross-entropy optimization Mathematics Neural networks (Computer science) Computational linguistics.
192	Modelling syntactic gradience with loose constraint-based parsing: Modélisation de la gradience syntaxique par analyse relâchée à base de contraintes / Modélisation de la gradience syntaxique par analyse relâchée à base de contraintes Prost, Jean-Philippe January 2008 (has links) Thesis submitted for the joint institutional requirements for the double-badged degree of Doctor of Philosophy and Docteur de l'Université de Provence, Spécialité : Informatique. / Thesis (PhD)--Macquarie University, Division of Information and Communication Sciences, Department of Computing, 2008. / Includes bibliography (p. 229-240) and index. / Introduction -- Background -- A model-theoretic framework for PG -- Loose constraint-based parsing -- A computational model for gradience -- Conclusion. / The grammaticality of a sentence has conventionally been treated in a binary way: either a sentence is grammatical or not. A growing body of work, however, focuses on studying intermediate levels of acceptability, sometimes referred to as gradience. To date, the bulk of this work has concerned itself with the exploration of human assessments of syntactic gradience. This dissertation explores the possibility to build a robust computational model that accords with these human judgements. -- We suggest that the concepts of Intersective Gradience and Subsective Gradience introduced by Aarts for modelling graded judgements be extended to cover deviant language. Under such a new model, the problem then raised by gradience is to classify an utterance as a member of a specific category according to its syntactic characteristics. More specifically, we extend Intersective Gradience (IG) so that it is concerned with choosing the most suitable syntactic structure for an utterance among a set of candidates, while Subsective Gradience (SG) is extended to be concerned with calculating to what extent the chosen syntactic structure is typical from the category at stake. IG is addressed in relying on a criterion of optimality, while SG is addressed in rating an utterance according to its grammatical acceptability. As for the required syntactic characteristics, which serve as features for classifying an utterance, our investigation of different frameworks for representing the syntax of natural language shows that they can easily be represented in Model-Theoretic Syntax; we choose to use Property Grammars (PG), which offers to model the characterisation of an utterance. We present here a fully automated solution for modelling syntactic gradience, which characterises any well formed or ill formed input sentence, generates an optimal parse for it, then rates the utterance according to its grammatical acceptability. -- Through the development of such a new model of gradience, the main contribution of this work is three-fold. -- First, we specify a model-theoretic logical framework for PG, which bridges the gap observed in the existing formalisation regarding the constraint satisfaction and constraint relaxation mechanisms, and how they relate to the projection of a category during the parsing process. This new framework introduces the notion of loose satisfaction, along with a formulation in first-order logic, which enables reasoning about the characterisation of an utterance. -- Second, we present our implementation of Loose Satisfaction Chart Parsing (LSCP), a dynamic programming approach based on the above mechanisms, which is proven to always find the full parse of optimal merit. Although it shows a high theoretical worst time complexity, it performs sufficiently well with the help of heuristics to let us experiment with our model of gradience. -- And third, after postulating that human acceptability judgements can be predicted by factors derivable from LSCP, we present a numeric model for rating an utterance according to its syntactic gradience. We measure a good correlation with grammatical acceptability by human judgements. Moreover, the model turns out to outperform an existing one discussed in the literature, which was experimented with parses generated manually. / Mode of access: World Wide Web. / xxviii, 283 p. ill Computational linguistics Mathematical linguistics Semantics Algorithms gradience acceptability grammaticality optimality configuration Model-Theoretic Syntax Property Grammars characterisation constraint-based chart parsing robustness loose constraint satisfaction
193	Skoner en kleiner vertaalgeheues Wolff, Friedel 10 1900 (has links) Rekenaars kan ’n nuttige rol speel in vertaling. Twee benaderings is vertaalgeheuestelsels en masjienvertaalstelsels. By hierdie twee tegnologieë word ’n vertaalgeheue gebruik—’n tweetalige versameling vorige vertalings. Hierdie proefskrif bied metodes aan om die kwaliteit van ’n vertaalgeheue te verbeter. ’n Masjienleerbenadering word gevolg om foutiewe inskrywings in ’n vertaalgeheue te identifiseer. ’n Verskeidenheid leerkenmerke in drie kategorieë word aangebied: kenmerke wat verband hou met tekslengte, kenmerke wat deur kwaliteittoetsers soos vertaaltoetsers, ’n speltoetser en ’n grammatikatoetser bereken word, asook statistiese kenmerke wat met behulp van eksterne data bereken word. Die evaluasie van vertaalgeheuestelsels is nog nie gestandaardiseer nie. In hierdie proefskrif word ’n verskeidenheid probleme met bestaande evaluasiemetodes uitgewys, en ’n verbeterde evaluasiemetode word ontwikkel. Deur die foutiewe inskrywings uit ’n vertaalgeheue te verwyder, is ’n kleiner, skoner vertaalgeheue beskikbaar vir toepassings. Eksperimente dui aan dat so ’n vertaalgeheue beter prestasie behaal in ’n vertaalgeheuestelsel. As ondersteunende bewys vir die waarde van ’n skoner vertaalgeheue word ’n verbetering ook aangedui by die opleiding van ’n masjienvertaalstelsel. / Computers can play a useful role in translation. Two approaches are translation memory systems and machine translation systems. With these two technologies a translation memory is used— a bilingual collection of previous translations. This thesis presents methods to improve the quality of a translation memory. A machine learning approach is followed to identify incorrect entries in a translation memory. A variety of learning features in three categories are presented: features associated with text length, features calculated by quality checkers such as translation checkers, a spell checker and a grammar checker, as well as statistical features computed with the help of external data. The evaluation of translation memory systems is not yet standardised. This thesis points out a number of problems with existing evaluation methods, and an improved evaluation method is developed. By removing the incorrect entries in a translation memory, a smaller, cleaner translation memory is available to applications. Experiments demonstrate that such a translation memory results in better performance in a translation memory system. As supporting evidence for the value of a cleaner translation memory, an improvement is also achieved in training a machine translation system. / School of Computing / Ph. D. (Rekenaarwetenskap) Taaltegnologie Natuuurliketaalverwerking Vertaalgeheue Rekenaargesteunde vertaling Parallelle korpus Datakwaliteit Masjienleer Masjienvertaling Regressie Klassifikasie Language technology Natural language processing Translation memory Computer-assisted translation Parallel corpus Data quality Machine learning Machine translation Regression Classification 006.35 Machine learning Machine translating
194	An exploratory study using the predicate-argument structure to develop methodology for measuring semantic similarity of radiology sentences Newsom, Eric Tyner 12 November 2013 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The amount of information produced in the form of electronic free text in healthcare is increasing to levels incapable of being processed by humans for advancement of his/her professional practice. Information extraction (IE) is a sub-field of natural language processing with the goal of data reduction of unstructured free text. Pertinent to IE is an annotated corpus that frames how IE methods should create a logical expression necessary for processing meaning of text. Most annotation approaches seek to maximize meaning and knowledge by chunking sentences into phrases and mapping these phrases to a knowledge source to create a logical expression. However, these studies consistently have problems addressing semantics and none have addressed the issue of semantic similarity (or synonymy) to achieve data reduction. To achieve data reduction, a successful methodology for data reduction is dependent on a framework that can represent currently popular phrasal methods of IE but also fully represent the sentence. This study explores and reports on the benefits, problems, and requirements to using the predicate-argument statement (PAS) as the framework. A convenient sample from a prior study with ten synsets of 100 unique sentences from radiology reports deemed by domain experts to mean the same thing will be the text from which PAS structures are formed. Natural Language Processing Information Extraction Predicate-Argument Structure Semantic Similarity Computational linguistics -- Analysis Semantic computing -- Research Semantics -- Data processing Description logics Data mining Semantic Web Text processing (Computer science) Predicate (Logic) Medical informatics -- Data processing
195	Context specific text mining for annotating protein interactions with experimental evidence Pandit, Yogesh 03 January 2014 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Proteins are the building blocks in a biological system. They interact with other proteins to make unique biological phenomenon. Protein-protein interactions play a valuable role in understanding the molecular mechanisms occurring in any biological system. Protein interaction databases are a rich source on protein interaction related information. They gather large amounts of information from published literature to enrich their data. Expert curators put in most of these efforts manually. The amount of accessible and publicly available literature is growing very rapidly. Manual annotation is a time consuming process. And with the rate at which available information is growing, it cannot be dealt with only manual curation. There need to be tools to process this huge amounts of data to bring out valuable gist than can help curators proceed faster. In case of extracting protein-protein interaction evidences from literature, just a mere mention of a certain protein by look-up approaches cannot help validate the interaction. Supporting protein interaction information with experimental evidence can help this cause. In this study, we are applying machine learning based classification techniques to classify and given protein interaction related document into an interaction detection method. We use biological attributes and experimental factors, different combination of which define any particular interaction detection method. Then using predicted detection methods, proteins identified using named entity recognition techniques and decomposing the parts-of-speech composition we search for sentences with experimental evidence for a protein-protein interaction. We report an accuracy of 75.1% with a F-score of 47.6% on a dataset containing 2035 training documents and 300 test documents. Data mining -- Analysis Systems biology -- Methodology Biology -- Data processing -- Research
196	Aural Mapping of STEM Concepts Using Literature Mining Bharadwaj, Venkatesh 06 March 2013 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Recent technological applications have made the life of people too much dependent on Science, Technology, Engineering, and Mathematics (STEM) and its applications. Understanding basic level science is a must in order to use and contribute to this technological revolution. Science education in middle and high school levels however depends heavily on visual representations such as models, diagrams, figures, animations and presentations etc. This leaves visually impaired students with very few options to learn science and secure a career in STEM related areas. Recent experiments have shown that small aural clues called Audemes are helpful in understanding and memorization of science concepts among visually impaired students. Audemes are non-verbal sound translations of a science concept. In order to facilitate science concepts as Audemes, for visually impaired students, this thesis presents an automatic system for audeme generation from STEM textbooks. This thesis describes the systematic application of multiple Natural Language Processing tools and techniques, such as dependency parser, POS tagger, Information Retrieval algorithm, Semantic mapping of aural words, machine learning etc., to transform the science concept into a combination of atomic-sounds, thus forming an audeme. We present a rule based classification method for all STEM related concepts. This work also presents a novel way of mapping and extracting most related sounds for the words being used in textbook. Additionally, machine learning methods are used in the system to guarantee the customization of output according to a user's perception. The system being presented is robust, scalable, fully automatic and dynamically adaptable for audeme generation. Science -- Study and teaching Technology -- Study and teaching Engineering -- Study and teaching Mathematics -- Study and teaching Nonverbal communication in education Computers -- Valuation Assistive computer technology Educational technology -- Evaluation Intelligent agents (Computer software) Semantics -- Data processing Machine learning User-centered system design Parsing (Computer grammar)

Page generated in 0.5641 seconds