Global ETD Search

11	From distributional to semantic similarity Curran, James Richard January 2004 (has links) Lexical-semantic resources, including thesauri and WORDNET, have been successfully incorporated into a wide range of applications in Natural Language Processing. However they are very difficult and expensive to create and maintain, and their usefulness has been severely hampered by their limited coverage, bias and inconsistency. Automated and semi-automated methods for developing such resources are therefore crucial for further resource development and improved application performance. Systems that extract thesauri often identify similar words using the distributional hypothesis that similar words appear in similar contexts. This approach involves using corpora to examine the contexts each word appears in and then calculating the similarity between context distributions. Different definitions of context can be used, and I begin by examining how different types of extracted context influence similarity. To be of most benefit these systems must be capable of finding synonyms for rare words. Reliable context counts for rare events can only be extracted from vast collections of text. In this dissertation I describe how to extract contexts from a corpus of over 2 billion words. I describe techniques for processing text on this scale and examine the trade-off between context accuracy, information content and quantity of text analysed. Distributional similarity is at best an approximation to semantic similarity. I develop improved approximations motivated by the intuition that some events in the context distribution are more indicative of meaning than others. For instance, the object-of-verb context wear is far more indicative of a clothing noun than get. However, existing distributional techniques do not effectively utilise this information. The new context-weighted similarity metric I propose in this dissertation significantly outperforms every distributional similarity metric described in the literature. Nearest-neighbour similarity algorithms scale poorly with vocabulary and context vector size. To overcome this problem I introduce a new context-weighted approximation algorithm with bounded complexity in context vector size that significantly reduces the system runtime with only a minor performance penalty. I also describe a parallelized version of the system that runs on a Beowulf cluster for the 2 billion word experiments. To evaluate the context-weighted similarity measure I compare ranked similarity lists against gold-standard resources using precision and recall-based measures from Information Retrieval, since the alternative, application-based evaluation, can often be influenced by distributional as well as semantic similarity. I also perform a detailed analysis of the final results using WORDNET. Finally, I apply my similarity metric to the task of assigning words to WORDNET semantic categories. I demonstrate that this new approach outperforms existing methods and overcomes some of their weaknesses. 005.3 Natural Language Processing ; thesauri
12	A novel stroke prediction model based on clinical natural language processing (NLP) and data mining methods Sedghi, Elham 30 March 2017 (has links) Early detection and treatment of stroke can save lives. Before any procedure is planned, the patient is traditionally subjected to a brain scan such as Magnetic Resonance Imaging (MRI) in order to make sure he/she receives a safe treatment. Before any imaging is performed, the patient is checked into Emergency Room (ER) and clinicians from the Stroke Rapid Assessment Unit (SRAU) perform an evaluation of the patient's signs and symptoms. The question we address in this thesis is: Can Data Mining (DM) algorithms be employed to reliably predict the occurrence of stroke in a patient based on the signs and symptoms gathered by the clinicians and other staff in the ER or the SRAU? A reliable DM algorithm would be very useful in helping the clinicians make a better decision whether to escalate the case or classify it as a non-life threatening mimic and not put the patient through unnecessary imaging and tests. Such an algorithm would not only make the life of patients and clinicians easier but would also enable the hospitals to cut down on their costs. Most of the signs and symptoms gathered by clinicians in the ER or the SRAU are stored in free-text format in hospital information systems. Using techniques from Natural Language Processing (NLP), the vocabularies of interest can be extracted and classiffied. A big challenge in this process is that medical narratives are full of misspelled words and clinical abbreviations. It is a well known fact that the quality of data mining results crucially depends on the quality of input data. In this thesis, as a rst contribution, we describe a procedure to preprocess the raw data and transform it into clean, well-structured data that can be effectively used by DM learning algorithms. Another contribution of this thesis is producing a set of carefully crafted rules to perform detection of negated meaning in free-text sentences. Using these rules, we were able to get the correct semantics of sentences and provide much more useful datasets to DM learning algorithms. This thesis consists of three main parts. In the first part, we focus on building classi ers to reliably distinguish stroke and Transient Ischemic Attack (TIA) from mimic cases. For this, we used text extracted from the "chief complaint" and "history of patient illness" fields available in the patients' les at the Victoria General Hospital (VGH). In collaboration with stroke specialists, we identified a well-de ned set of stroke-related keywords. Next, we created practical tools to accurately assign keywords from this set to each patient. Then, we performed extensive experiments for nding the right learning algorithm to build the best classifier that provides a good balance between sensitivity, specificity, and a host of other quality indicators. In the second part, we focus on the most important mimic case, migraine, and how to e ectively distinguish it from stroke or TIA. This is a challenging problem because migraine has many signs and symptoms that are similar to those of stroke or TIA. Another challenge we address is the imbalance that our datasets have with respect to migraine. Namely the migraine cases are a minority of the overall cases. In order to alleviate this rarity problem, we propose a randomization procedure which is able to drastically improve the classi er quality. Finally, in the third part, we provide a detailed study on datamining algorithms for extracting the most important predictors that can help to detect and prevent Posterior circulation stroke. We compared our finding with the attributes reported by the Heart and Stroke Foundation of Canada, and the features found in our study performed better in accuracy, sensitivity, and ROC. / Graduate Data Mining Natural Language Processing
13	SemNet : the knowledge representation of LOLITA Baring-Gould, Sengan January 2000 (has links) Many systems of Knowledge Representation exist, but none were designed specifically for general purpose large scale natural language processing. This thesis introduces a set of metrics to evaluate the suitability of representations for this purpose, derived from an analysis of the problems such processing introduces. These metrics address three broad categories of question: Is the representation sufficiently expressive to perform its task? What implications has its design on the architecture of the system using it? What inefficiencies are intrinsic to its design? An evaluation of existing Knowledge Representation systems reveals that none of them satisfies the needs of general purpose large scale natural language processing. To remedy this lack, this thesis develops a new representation: SemNet. SemNet benefits not only from the detailed requirements analysis but also from insights gained from its use as the core representation of the large scale general purpose system LOLITA (Large-scale Object-based Linguistic Interactor, Translator, and Analyser). The mapping process between Natural language and representation is presented in detail, showing that the representation achieves its goals in practice. 005 Semantics; Natural language processing
14	Inference of string mappings for speech technology Jansche, Martin, January 2003 (has links) Thesis (Ph. D.)--Ohio State University, 2003. / Title from first page of PDF file. Document formatted into pages; contains xv, 268 p.; also includes graphics. Includes abstract and vita. Advisor: Chris Brew, Dept. of Linguistics. Includes bibliographical references (p. 252-266) and index.
15	Changing group dynamics through computerized language feedback Tausczik, Yla Rebecca 20 November 2012 (has links) Why do some groups of people work well together while others do not? It is commonly accepted that effective groups communicate well. Yet one of the biggest roadblocks facing the study of group communication is that it is extremely difficult to capture real-world group interactions and analyze the words people use in a timely manner. This project overcame this limitation in two ways. First, a broader and more systematic study of group processes was conducted by using a computerized text analysis program (Linguistic Inquiry and Word Count) that automatically codes natural language using pre-established rules. Groups that work well together typically exchange more knowledge and establish good social relationships, which is reflected in the way that they use words. The group dynamics of over 500 student discussion groups interacting via group chat were assessed by studying their language use. Second, a language feedback system was built to experimentally test the importance of certain group processes on group satisfaction and performance. It is now possible to provide language feedback by processing natural language dialogue using computerized text analysis in real time. The language feedback system can change the way the group works by providing individualized recommendations. In this way it is possible to manipulate group processes naturalistically. Together these studies provided evidence that important group processes can be detected even using simplistic natural language processing, and preliminary evidence that providing real-time feedback based on the words students use in a group discussion can improve learning by changing how the group works together. / text Teamwork Natural language processing Intervention
16	A hybrid approach to fuzzy name search incorporating language-based and textbased principles Wu, Paul Horng Jyh, Na, Jin Cheon, Khoo, Christopher S.G. January 2007 (has links) Name Search is an important search function in various types of information retrieval systems, such as online library catalogs and electronic yellow pages. It is also difficult due to the high degree of fuzziness required in matching name variants. Previous approaches to name search systems use ad hoc combinations of search heuristics. This paper first discusses two approaches to name modelingâ the natural language processing (NLP) and the information retrieval (IR) modelsâ and proposes a hybrid approach. The approach demonstrates a critical combination of complementary NLP and IR features that produces more effective fuzzy name matching. Two principles, position-as-attribute and position-transitionlikelihood, are introduced as the principles for integrating the advantageous aspects of both approaches. They have been implemented in an NLP- and IR- hybrid model system called Friendly Name Search (FNS) for real world applications in multilingual directory searches on the Singapore Yellow pages website. Information Retrieval Natural Language Processing
17	A shallow parser based on closed-class words to capture relations in biomedical text Leroy, Gondy, Chen, Hsinchun, Martinez, Jesse D. 06 1900 (has links) Artificial Intelligence Lab, Department of MIS, University of Arizona / Natural language processing for biomedical text currently focuses mostly on entity and relation extraction. These entities and relations are usually pre-specified entities, e.g., proteins, and pre-specified relations, e.g., inhibit relations. A shallow parser that captures the relations between noun phrases automatically from free text has been developed and evaluated. It uses heuristics and a noun phraser to capture entities of interest in the text. Cascaded finite state automata structure the relations between individual entities. The automata are based on closed-class English words and model generic relations not limited to specific words. The parser also recognizes coordinating conjunctions and captures negation in text, a feature usually ignored by others. Three cancer researchers evaluated 330 relations extracted from 26 abstracts of interest to them. There were 296 relations correctly extracted from the abstracts resulting in 90% precision of the relations and an average of 11 correct relations per abstract. Artificial Intelligence Natural Language Processing
18	Simplifying natural language for aphasic readers Devlin, Siobhan Lucy January 1999 (has links) No description available. 150 Aphasia; Natural language processing
19	Word sense selection in texts an integrated model / Kwong, Oi Yee. January 1900 (has links) Thesis (Ph. D.)--University of Cambridge, 2000. / Cover title. "September 2000." Includes bibliographical references.
20	Predicting Depression and Suicide Ideation in the Canadian Population Using Social Media Data Skaik, Ruba 30 June 2021 (has links) The economic burden of mental illness costs Canada billions of dollars every year. Millions of people suffer from mental illness, and only a fraction receives adequate treatment. Identifying people with mental illness requires initiation from those in need, available medical services, and professional experts’ time allocation. These resources might not be available all the time. The common practice is to rely on clinical data, which is generally collected after the illness is developed and reported. Moreover, such clinical data is incomplete and hard to obtain. An alternative data source is conducting surveys through phone calls, interviews, or mail, but this is costly and time-consuming. Social media analysis has brought advances in leveraging population data to understand mental health problems. Thus, analyzing social media posts can be an essential alternative for identifying mental disorders throughout the Canadian population. Big data research of social media may also endorse standard surveillance approaches and provide decision-makers with usable information. More precisely, social media analysis has shown promising results for public health assessment and monitoring. In this research, we explore the task of automatically analysing social media textual data using Natural Language Processing (NLP) and Machine Learning (ML) techniques to detect signs of mental health disorders that need attention, such as depression and suicide ideation. Considering the lack of comprehensive annotated data in this field, we propose a methodology for transfer learning to utilize the information hidden in a training sample and leverage it on a different dataset to choose the best-generalized model to be applied at the population level. We also present evidence that ML models designed to predict suicide ideation using Reddit data can utilize the knowledge they encoded to make predictions on Twitter data, even though the two platforms differ in the purpose, structure, and limitations. In our proposed models, we use feature engineering with supervised machine learning algorithms (such as SVM, LR, RF, XGBoost, and GBDT), and we compare their results with those of deep learning algorithms (such as LSTM, Bi-LSTM, and CNNs). We adopt the CNN model for depression classification that obtained the highest F1-score on the test dataset (0.898) and 0.941 recall. This model is later used to estimate the depression level of the population. For suicide ideation detection, we used the CNN model with pre-trained fastText word embeddings and linguistic features (LIWC). The model achieved an F1-score of 0.936 and a recall of 0.88 to predict suicide ideation at the user-level on the test set. To compare our models’ predictions with official statics, we used 2015-2016 population based Canadian Community Health Survey (CCHS) on Mental Health and Well-being conducted by Statistics Canada. The data is used to estimate depression and suicidality in Canadian provinces and territories. For depression, (n=53,050) respondents filled in the Patient Health Questionnaire-9 (PHQ-9) from 8 provinces/territories. Each survey respondent with a score ≥ 10 on the PHQ-9 was interpreted as having moderate to severe depression because this score is frequently used as a screening cut-point. The weighted percentage of depression prevalence during 2015 for females and males of the age between 15 to 75 was 11.5% and 8.1%, respectively (with 54.2% females and 45.8% males). Our model was applied on a population-representative dataset that contains 24,251 Twitter users who posted 1,735,200 tweets during 2015 with a Pearson correlation of 0.88 for both sex and age within the seven provinces and NT territory included in the CCHS. An age correlation of 0.95 was calculated for age and sex (separately) and our model estimated that 10% of the sample dataset has evidence of depression (58.3% females and 41.7% males). For the second task, suicide ideation, Statistics Canada (2015) estimated the total number of people who reported serious suicidal thoughts as 3,396,700 persons, i.e., 9.514% of the total population, whereas our models estimated 10.6% of the population sample were at risk of suicide ideation (59% females and 41% males). The Pearson correlation coefficients between the actual suicide ideation within the last 12 months and the predicted model for each province per age, sex, and both more than 0.62, which indicates a reasonable correlation. Machine Learning Natural Language Processing

Search results