Global ETD Search

201	Automatic Supervised Thesauri Construction with Roget’s Thesaurus Kennedy, Alistair H January 2012 (has links) Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition dates from 1911. This thesis proposes and tests methods of automatically updating the vocabulary of the 1911 Roget’s Thesaurus. I use the Thesaurus as a source of training data in order to learn from Roget’s for the purpose of updating Roget’s. The lexicon is updated in two stages. First, I develop a measure of semantic relatedness that enhances existing distributional techniques. I improve existing methods by using known sets of synonyms from Roget’s to train a distributional measure to better identify near synonyms. Second, I use the new measure of semantic relatedness to find where in Roget’s to place a new word. Existing words from Roget’s are used as training data to tune the parameters of three methods of inserting words. Over 5000 new words and word-senses were added using this process. I conduct two kinds of evaluation on the updated Thesaurus. One is on the procedure for updating Roget’s. This is accomplished by removing some words from the Thesaurus and testing my system's ability to reinsert them in the correct location. Human evaluation of the newly added words is also performed. Annotators must determine whether a newly added word is in the correct location. They found that in most cases the new words were almost indistinguishable from those already existing in Roget's Thesaurus. The second kind of evaluation is to establish the usefulness of the updated Roget’s Thesaurus on actual Natural Language Processing applications. These applications include determining semantic relatedness between word pairs or sentence pairs, identifying the best synonym from a set of candidates, solving SAT-style analogy problems, pseudo-word-sense disambiguation, and sentence ranking for text summarization. The updated Thesaurus consistently performed at least as well or better the original Thesaurus on all these applications. Roget's Thesaurus Natural Language Processing Distributional Semantics Thesauri construction
202	Evaluating Text Segmentation Fournier, Christopher January 2013 (has links) This thesis investigates the evaluation of automatic and manual text segmentation. Text segmentation is the process of placing boundaries within text to create segments according to some task-dependent criterion. An example of text segmentation is topical segmentation, which aims to segment a text according to the subjective definition of what constitutes a topic. A number of automatic segmenters have been created to perform this task, and the question that this thesis answers is how to select the best automatic segmenter for such a task. This requires choosing an appropriate segmentation evaluation metric, confirming the reliability of a manual solution, and then finally employing an evaluation methodology that can select the automatic segmenter that best approximates human performance. A variety of comparison methods and metrics exist for comparing segmentations (e.g., WindowDiff, Pk), and all save a few are able to award partial credit for nearly missing a boundary. Those comparison methods that can award partial credit unfortunately lack consistency, symmetricity, intuition, and a host of other desirable qualities. This work proposes a new comparison method named boundary similarity (B) which is based upon a new minimal boundary edit distance to compare two segmentations. Near misses are frequent, even among manual segmenters (as is exemplified by the low inter-coder agreement reported by many segmentation studies). This work adapts some inter-coder agreement coefficients to award partial credit for near misses using the new metric proposed herein, B. The methodologies employed by many works introducing automatic segmenters evaluate them simply in terms of a comparison of their output to one manual segmentation of a text, and often only by presenting nothing other than a series of mean performance values (along with no standard deviation, standard error, or little if any statistical hypothesis testing). This work asserts that one segmentation of a text cannot constitute a “true” segmentation; specifically, one manual segmentation is simply one sample of the population of all possible segmentations of a text and of that subset of desirable segmentations. This work further asserts that an adapted inter-coder agreement statistics proposed herein should be used to determine the reproducibility and reliability of a coding scheme and set of manual codings, and then statistical hypothesis testing using the specific comparison methods and methodologies demonstrated herein should be used to select the best automatic segmenter. This work proposes new segmentation evaluation metrics, adapted inter-coder agreement coefficients, and methodologies. Most important, this work experimentally compares the state-or-the-art comparison methods to those proposed herein upon artificial data that simulates a variety of scenarios and chooses the best one (B). The ability of adapted inter-coder agreement coefficients, based upon B, to discern between various levels of agreement in artificial and natural data sets is then demonstrated. Finally, a contextual evaluation of three automatic segmenters is performed using the state-of-the art comparison methods and B using the methodology proposed herein to demonstrate the benefits and versatility of B as opposed to its counterparts. computational linguistics evaluation natural language processing segmentation boundary similarity
203	Constructing component-based systems directly from requirements using incremental composition Nordin, Azlin January 2013 (has links) In software engineering, system construction typically starts from a requirements specification that has been engineered from raw requirements in a natural language. The specification is used to derive intermediate requirements models such as structured or object-oriented models. Throughout the stages of system construction, these artefacts will be used as reference models. In general, in order to derive a design specification out of the requirements, the entire set of requirements specifications has to be analysed. Such models at best only approximate the raw requirements since these design models are derived as a result of the abstraction process according to the chosen software development methodology, and subjected to the expertise, intuition, judgment and experiences of the analysts or designers of the system. These abstraction models require the analysts to elicit all useful information from the requirements, and there is a potential risk that some information may be lost in the process of model construction. As the use of natural language requirements in system construction is inevitable, the central focus of this study was to use requirements stated in natural language in contrast to any other requirements representation (e.g. modelling artefact). In this thesis, an approach that avoids intermediate requirements models, and maps natural language requirements directly into architectural constructs, and thus minimises information loss during the model construction process, has been defined. This approach has been grounded on the adoption of a component model that supports incremental composition. Incremental composition allows a system to be constructed piece by piece. By mapping a raw requirement to elements of the component model, a partial architecture that satisfies that requirement is constructed. Consequently, by iterating this process for all the requirements, one at a time, the incremental composition to build the system piece by piece directly from the requirements can be achieved. In software engineering, system construction typically starts from a requirements specification that has been engineered from raw requirements in a natural language. The specification is used to derive intermediate requirements models such as structured or object-oriented models. Throughout the stages of system construction, these artefacts will be used as reference models. In general, in order to derive a design specification out of the requirements, the entire set of requirements specifications has to be analysed. Such models at best only approximate the raw requirements since these design models are derived as a result of the abstraction process according to the chosen software development methodology, and subjected to the expertise, intuition, judgment and experiences of the analysts or designers of the system. These abstraction models require the analysts to elicit all useful information from the requirements, and there is a potential risk that some information may be lost in the process of model construction. As the use of natural language requirements in system construction is inevitable, the central focus of this study was to use requirements stated in natural language in contrast to any other requirements representation (e.g. modelling artefact). In this thesis, an approach that avoids intermediate requirements models, and maps natural language requirements directly into architectural constructs, and thus minimises information loss during the model construction process, has been defined. This approach has been grounded on the adoption of a component model that supports incremental composition. Incremental composition allows a system to be constructed piece by piece. By mapping a raw requirement to elements of the component model, a partial architecture that satisfies that requirement is constructed. Consequently, by iterating this process for all the requirements, one at a time, the incremental composition to build the system piece by piece directly from the requirements can be achieved. 006.3
204	Information Density and Persuasiveness in Naturalistic Data January 2020 (has links) abstract: Attitudes play a fundamental role when making critical judgments and the extremity of people’s attitudes can be influenced by one’s emotions, beliefs, or past experiences and behaviors. Human attitudes and preferences are susceptible to social influence and attempts to influence or change another person’s attitudes are pervasive in all societies. Given the importance of attitudes and attitude change, the current project investigated linguistic aspects of conversations that lead to attitude change by analyzing a dataset mined from Reddit’s Change My View (Priniski & Horne, 2018). Analysis of the data was done using Natural Language Processing (NLP), specifically information density, to predict attitude change. Top posts from Reddit’s (N = 510,149) were imported and processed in Python and information density measures were computed. The results indicate that comments with higher information density are more likely to be awarded a delta and are perceived to be more persuasive. / Dissertation/Thesis / Masters Thesis Psychology 2020 Psychology Information Density Language Natural Language Processing Persuasiveness
205	Examination of Gender Bias in News Articles Damin Zhang (11814182) 19 December 2021 (has links) Reading news articles from online sources has become a major choice of obtaining information for many people. Authors who wrote news articles could introduce their own biases either unintentionally or intentionally by using or choosing to use different words to describe otherwise neutral and factual information. Such intentional word choices could create conflicts among different social groups, showing explicit and implicit biases. Any type of biases within the text could affect the reader’s view of the information. One type of biases in natural language is gender bias that had been discovered in a lot of Natural Language Processing (NLP) models, largely attributed to implicit biases in the training text corpora. Analyzing gender bias or stereotypes in such large corpora is a hard task. Previous methods of bias detection were applied to short text like tweets, and to manually built datasets, but little works had been done on long text like news articles in large corpora. Simply detecting bias on annotated text does not help to understand how it was generated and reproduced. Instead, we used structural topic modeling on a large unlabelled corpus of news articles, incorporated qualitative results and quantitative analysis to examine how gender bias was generated and reproduced. This research extends the prior knowledge of bias detection and proposed a method for understanding gender bias in real-world settings. We found that author gender correlated to the topic-gender prevalence and skewed media-gender distribution assist understanding gender bias within news articles. Natural Language Processing Natural Language Processing Topic Modeling Gender Bias
206	Emergency Medical Service EMR-Driven Concept Extraction From Narrative Text George, Susanna Serene 08 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Being in the midst of a pandemic with patients having minor symptoms that quickly become fatal to patients with situations like a stemi heart attack, a fatal accident injury, and so on, the importance of medical research to improve speed and efficiency in patient care, has increased. As researchers in the computer domain work hard to use automation in technology in assisting the first responders in the work they do, decreasing the cognitive load on the field crew, time taken for documentation of each patient case and improving accuracy in details of a report has been a priority. This paper presents an information extraction algorithm that custom engineers certain existing extraction techniques that work on the principles of natural language processing like metamap along with syntactic dependency parser like spacy for analyzing the sentence structure and regular expressions to recurring patterns, to retrieve patient-specific information from medical narratives. These concept value pairs automatically populates the fields of an EMR form which could be reviewed and modified manually if needed. This report can then be reused for various medical and billing purposes related to the patient. Concept extraction Natural Language Processing EMR-driven Multi-label classification
207	A Hybrid Approach to General Information Extraction Grap, Marie Belen 01 September 2015 (has links) Information Extraction (IE) is the process of analyzing documents and identifying desired pieces of information within them. Many IE systems have been developed over the last couple of decades, but there is still room for improvement as IE remains an open problem for researchers. This work discusses the development of a hybrid IE system that attempts to combine the strengths of rule-based and statistical IE systems while avoiding their unique pitfalls in order to achieve high performance for any type of information on any type of document. Test results show that this system operates competitively in cases where target information belongs to a highly-structured data type and when critical contextual information is in close proximity to the target. information extraction machine learning natural language processing Computational Engineering
208	CREATE: Clinical Record Analysis Technology Ensemble Eglowski, Skylar 01 June 2017 (has links) In this thesis, we describe an approach that won a psychiatric symptom severity prediction challenge. The challenge was to correctly predict the severity of psychiatric symptoms on a 4-point scale. Our winning submission uses a novel stacked machine learning architecture in which (i) a base data ingestion/cleaning step was followed by the (ii) derivation of a base set of features defined using text analytics, after which (iii) association rule learning was used in a novel way to generate new features, followed by a (iv) feature selection step to eliminate irrelevant features, followed by a (v) classifier training algorithm in which a total of 22 classifiers including new classifier variants of AdaBoost and RandomForest were trained on seven different data views, and (vi) finally an ensemble learning step, in which ensembles of best learners were used to improve on the accuracy of individual learners. All of this was tested via standard 10-fold cross-validation on training data provided by the N-GRID challenge organizers, of which the three best ensembles were selected for submission to N-GRID's blind testing. The best of our submitted solutions garnered an overall final score of 0.863 according to the organizer's measure. All 3 of our submissions placed within the top 10 out of the 65 total submissions. The challenge constituted Track 2 of the 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDOC Individualized Domains (N-GRID) Shared Task in Clinical Natural Language Processing. clinical data analysis natural language processing N-GRID challenge
209	Modeli srpskog jezika i njihova primena u govornim i jezičkim tehnologijama / Models of the Serbian language and their application in speech and language technologies Ostrogonac Stevan 21 December 2018 (has links) <p>Statistički jezički model, u teoriji, predstavlja raspodelu verovatnoća nad skupom svih mogućih sekvenci reči nekog jezika. U praksi, to je mehanizam kojim se estimiraju verovatnoće sekvenci, koje su od interesa. Matematički aparat vezan za modele jezika je uglavnom nezavisan od jezika. Međutim, kvalitet obučenih modela ne zavisi samo od algoritama obuke, već prvenstveno od količine i kvaliteta podataka koji su na raspolaganju za obuku. Za jezike sa kompleksnom morfologijom, kao &scaron;to je srpski, tekstualni korpus za obuku modela mora biti daleko obimniji od korpusa koji bi se koristio kod nekog od jezika sa relativno jednostavnom morfologijom, poput engleskog. Ovo istraživanje obuhvata razvoj jezičkih modela za srpski jezik, počev&scaron;i od prikupljanja i inicijalne obrade tekstualnih sadržaja, preko adaptacije algoritama i razvoja metoda za re&scaron;avanje problema nedovoljne količine podataka za obuku, pa do prilagođavanja i primene modela u različitim tehnologijama, kao &scaron;to su sinteza govora na osnovu teksta, automatsko prepoznavanje govora, automatska detekcija i korekcija gramatičkih i semantičkih gre&scaron;aka u tekstovima, a postavljaju se i osnove za primenu jezičkih modela u automatskoj klasifikaciji dokumenata i drugim tehnologijama. Jezgro razvoja jezičkih modela za srpski predstavlja definisanje morfolo&scaron;kih klasa reči na osnovu informacija koje su sadržane u morfolo&scaron;kom rečniku, koji je nastao kao rezultat jednog od ranijih istraživanja.</p> / <p>A statistical language model, in theory, represents a probability distribution over sequences of words of a language. In practice, it is a tool for estimating probabilities of word sequences of interest. Mathematical basis related to language models is mostly language independent. However, the quality of trained models depends not only on training algorithms, but on the amount and quality of available training data as well. For languages with complex morphology, such as Serbian, textual corpora for training language models need to be significantly larger than the corpora needed for training language models for languages with relatively simple morphology, such as English. This research represents the entire process of developing language models for Serbian, starting with collecting and preprocessing of textual contents, extending to adaptation of algorithms and development of methods for addressing the problem of insufficient training data, and finally to adaptation and application of the models in different technologies, such as text-to-speech synthesis, automatic speech recognition, automatic detection and correction of grammar and semantic errors in texts, and determining basics for the application of the models in automatic document classification and other tasks. The core of the development of language models for Serbian is defining morphologic classes of words, based on the information contained within the morphologic dictionary of Serbian, which was one of the results of a previous research.</p>
210	Multi-Perspective Semantic Information Retrieval in the Biomedical Domain January 2020 (has links) abstract: Information Retrieval (IR) is the task of obtaining pieces of data (such as documents or snippets of text) that are relevant to a particular query or need from a large repository of information. IR is a valuable component of several downstream Natural Language Processing (NLP) tasks, such as Question Answering. Practically, IR is at the heart of many widely-used technologies like search engines. While probabilistic ranking functions, such as the Okapi BM25 function, have been utilized in IR systems since the 1970's, modern neural approaches pose certain advantages compared to their classical counterparts. In particular, the release of BERT (Bidirectional Encoder Representations from Transformers) has had a significant impact in the NLP community by demonstrating how the use of a Masked Language Model (MLM) trained on a considerable corpus of data can improve a variety of downstream NLP tasks, including sentence classification and passage re-ranking. IR Systems are also important in the biomedical and clinical domains. Given the continuously-increasing amount of scientific literature across biomedical domain, the ability find answers to specific clinical queries from a repository of millions of articles is a matter of practical value to medics, doctors, and other medical professionals. Moreover, there are domain-specific challenges present in the biomedical domain, including handling clinical jargon and evaluating the similarity or relatedness of various medical symptoms when determining the relevance between a query and a sentence. This work presents contributions to several aspects of the Biomedical Semantic Information Retrieval domain. First, it introduces Multi-Perspective Sentence Relevance, a novel methodology of utilizing BERT-based models for contextual IR. The system is evaluated using the BioASQ Biomedical IR Challenge. Finally, practical contributions in the form of a live IR system for medics and a proposed challenge on the Living Systematic Review clinical task are provided. / Dissertation/Thesis / Masters Thesis Computer Science 2020 Computer science BERT information retrieval natural language processing

Search results