41 |
Predicting Depression and Suicide Ideation in the Canadian Population Using Social Media DataSkaik, Ruba 30 June 2021 (has links)
The economic burden of mental illness costs Canada billions of dollars every year. Millions of people suffer from mental illness, and only a fraction receives adequate treatment. Identifying people with mental illness requires initiation from those in need, available medical services, and professional experts’ time allocation. These resources might not be available all the time. The common practice is to rely on clinical data, which is generally collected after the illness is developed and reported. Moreover, such clinical data is incomplete and hard to obtain. An alternative data source is conducting surveys through phone calls, interviews, or mail, but this is costly and time-consuming. Social media analysis has brought advances in leveraging population data to understand mental health problems. Thus, analyzing social media posts can be an essential alternative for identifying mental disorders throughout the Canadian population. Big data research of social media may also endorse standard surveillance approaches and provide decision-makers with usable information. More precisely, social media analysis has shown promising results for public health assessment and monitoring. In this research, we explore the task of automatically analysing social media textual data using Natural Language Processing (NLP) and Machine Learning (ML) techniques to detect signs of mental health disorders that need attention, such as depression and suicide ideation. Considering the lack of comprehensive annotated data in this field, we propose a methodology for transfer learning to utilize the information hidden in a training sample and leverage it on a different dataset to choose the best-generalized model to be applied at the population level. We also present evidence that ML models designed to predict suicide ideation using Reddit data can utilize the knowledge they encoded to make predictions on Twitter data, even though the two platforms differ in the purpose, structure, and limitations. In our proposed models, we use feature engineering with supervised machine learning algorithms (such as SVM, LR, RF, XGBoost, and GBDT), and we compare their results with those of deep learning algorithms (such as LSTM, Bi-LSTM, and CNNs). We adopt the CNN model for depression classification that obtained the highest F1-score on the test dataset (0.898) and 0.941 recall. This model is later used to estimate the depression level of the population. For suicide ideation detection, we used the CNN model with pre-trained fastText word embeddings and linguistic features (LIWC). The model achieved an F1-score of 0.936 and a recall of 0.88 to predict suicide ideation at the user-level on the test set.
To compare our models’ predictions with official statics, we used 2015-2016 population based Canadian Community Health Survey (CCHS) on Mental Health and Well-being conducted by Statistics Canada. The data is used to estimate depression and suicidality in Canadian provinces and territories.
For depression, (n=53,050) respondents filled in the Patient Health Questionnaire-9 (PHQ-9) from 8 provinces/territories. Each survey respondent with a score ≥ 10 on the PHQ-9 was interpreted as having moderate to severe depression because this score is frequently used as a screening cut-point. The weighted percentage of depression prevalence during 2015 for females and males of the age between 15 to 75 was 11.5% and 8.1%, respectively (with 54.2% females and 45.8% males). Our model was applied on a population-representative dataset that contains 24,251 Twitter users who posted 1,735,200 tweets during 2015 with a Pearson correlation of 0.88 for both sex and age within the seven provinces and NT territory included in the CCHS. An age correlation of 0.95 was calculated for age and sex (separately) and our model estimated that 10% of the sample dataset has evidence of depression (58.3% females and 41.7% males).
For the second task, suicide ideation, Statistics Canada (2015) estimated the total number of people who reported serious suicidal thoughts as 3,396,700 persons, i.e., 9.514% of the total population, whereas our models estimated 10.6% of the population sample were at risk of suicide ideation (59% females and 41% males). The Pearson correlation coefficients between the actual suicide ideation within the last 12 months and the predicted model for each province per age, sex, and both more than 0.62, which indicates a reasonable correlation.
|
42 |
Coping with Missing and Incomplete Information in Natural Language Processing with Applications in Sentiment Analysis and Entity MatchingSchneider, Andrew Thomas January 2020 (has links)
Much work in Natural Language Processing (NLP) is broadly concerned with extracting useful information from unstructured text passages. In recent years there has been an increased focus on informal writing as is found in online venues such as Twitter and Yelp. Processing this text introduces additional difficulties for NLP techniques, for example, many of the terms may be unknown due to rapidly changing vocabulary usage. A straightforward NLP approach will not have any capability of using the information these terms provide. In such \emph{information poor} environments of missing and incomplete information, it is necessary to develop novel, clever methods for leveraging the information we have explicitly available to unlock key nuggets of implicitly available information. In this work we explore several such methods and how they can collectively help to improve NLP techniques in general, with a focus on Sentiment Analysis (SA) and Entity Matching (EM). The problem of SA is that of identifying the polarity (positive, negative, neutral) of a speaker or author towards the topic of a given piece of text. SA can focus on various levels of granularity. These include finding the overall sentiment of a long text document, finding the sentiment of individual sentences or phrases, or finding the sentiment directed toward specific entities and their aspects (attributes). The problem of EM, also known as Record Linkage, is the problem of determining records from independent and uncooperative data sources that refer to the same real-world entities. Traditional approaches to EM have used the record representation of entities to accomplish this task. With the nascence of social media, entities on the Web are now accompanied by user generated content, which allows us to apply NLP solutions to the problem. We investigate specifically the following aspects of NLP for missing and incomplete information: (1) Inferring a sentiment polarity (i.e., the positive, negative, and neutral composition) of new terms. (2) Inferring a representation of new vocabulary terms that allows us to compare these terms with known terms in regards to their meaning and sentiment orientation. This idea can be further expanded to derive the representation of larger chunks of text, such as multi-word phrases. (3) Identifying key attributes of highly salient sentiment bearing passages that allow us to identify such sections of a document, even when the complete text is not analyzable. (4) Using text based methods to match corresponding entities (e.g., restaurants or hotels) from independent data sources that may miss key identifying attributes such as names or addresses. / Computer and Information Science
|
43 |
Enhancing Accessibility in Black-Box Attack Research with BinarySelect.pdfShatarupa Ghosh (18438924) 28 April 2024 (has links)
<p>Adversarial text attack research is crucial for evaluating NLP model robustness and addressing privacy concerns. However, the increasing complexity of transformer and pretrained</p>
<p>language models has led to significant time and resource requirements for training and testing. This challenge is particularly pronounced in black-box attacks, where hundreds</p>
<p>or thousands of queries may be needed to identify critical words leveraged by the target model. To overcome this, we introduce BinarySelect, a novel method combining binary search</p>
<p>with adversarial attack techniques to reduce query numbers significantly while maintaining attack effectiveness. Our experiments show that BinarySelect requires far fewer queries than traditional methods, making adversarial attack research more accessible to researchers with limited resources. We demonstrate the efficacy of BinarySelect across multiple datasets and classifiers, showcasing its potential for efficient adversarial attack exploration and addressing related black-box challenges.</p>
|
44 |
Use of Assembly Inspired Instructions in the Allowance of Natural Language Processing in ROSKakusa, Takondwa Lisungu 08 August 2018 (has links)
Natural Language processing is a growing field and widely used in both industrial and and commercial cases. Though it is difficult to create a natural language system that can robustly react to and handle every situation it is quite possible to design the system to react to specific instruction or scenario. The problem with current natural language systems used in machines, though, is that they are focused on single instructions, working to complete the instruction given then waiting for the next instruction. In this way they are not set to respond to possible conditions that are explained to them.
In the system designed and explained in this thesis, the goal is to fix this problem by introducing a method of adjusting to these conditions. The contributions made in this thesis are to design a set of instruction types that can be used in order to allow for conditional statements within natural language instructions. To create a modular system using ROS in order to allow for more robust communication and integration. Finally, the goal is to also allow for an interconnection between the written text and derived instructions that will make the sentence construction more seamless and natural for the user.
The work in this thesis will be limited in its focus to pertaining to the objective of obstacle traversal. The ideas and methodology, though, can be seen to extend into future work in the area. / Master of Science / With the growth of natural language processing and the development of artificial intelligence, it is important to take a look how to best allow these to work together. The main goal of this project is to find a way of integrating natural language so that it can be used in order to program a robot and in so doing, develop a method of translating that is not only efficient but also easy to understand. We have found we can accomplish this by creating a system that not only creates a direct correlation between the sentence and the instruction generated for the robot to understand, but also one that is able to break down complex sentences and paragraphs into multiple different instructions. This allows for a larger amount of robustness in the system.
|
45 |
Generative Chatbot Framework for Cybergrooming PreventionWang, Pei 20 December 2021 (has links)
Cybergrooming refers to the crime of establishing personal close relationships with potential victims, commonly teens, for the purpose of sexual exploitation or abuse via online social media platforms. Cybergrooming has been recognized as a serious social problem. However, there have been insufficient programs to provide proactive prevention to protect the youth users from cybergrooming. In this thesis, we present a generative chatbot framework, called SERI (Stop cybERgroomIng), that can generate simulated conversations between a perpetrator chatbot and a potential victim chatbot. To realize the simulation of authentic conversations in the context of cybergrooming, we take deep reinforcement learning (DRL)-based dialogue generation to simulate the authentic conversations between a perpetrator and a potential victim. The design and development of the SERI are motivated to provide a safe and authentic chatting environment to enhance the youth's precautionary awareness and sensitivity of cybergrooming while any unnecessary ethical issues (e.g., the potential misuse of the SERI) are removed or minimized. We developed the SERI as a preliminary platform that the perpetrator chatbot can be deployed in social media environments to interact with human users (i.e., youth) and observe the conversations that the youth users respond to strangers or acquaintances when they are asked for private or sensitive information by the perpetrator. We evaluated the quality of conversations generated by the SERI based on open-source, referenced, and unreferenced metrics as well as human evaluation. The evaluation results show that the SERI can generate authentic conversations between two chatbots compared to the original conversations from the used datasets in perplexity and MaUde scores. / Master of Science / Cybergrooming refers to the crime of building personal close relationships with potential victims, especially youth users such as children and teenagers, for the purpose of sexual exploitation or abuse via online social media platforms. Cybergrooming has been recognized as a serious social problem. However, there have been insufficient methods to provide proactive protection for the youth users from cybergrooming. In this thesis, we present a generative chatbot framework, called SERI (Stop cybERgroomIng), that can generate simulated authentic conversations between a perpetrator chatbot and a potential victim chatbot by applying advanced natural language generation models. The design and development of the SERI are motivated to ensure a safe and authentic environment to strengthen the youth's precautionary awareness and sensitivity of cybergrooming while any unnecessary ethical issues (e.g., the potential misuse of the SERI) are removed or minimized. We used different metrics and methods to evaluate the quality of conversations generated by the SERI. The evaluation results show that the SERI can generate authentic conversations between two chatbots compared to the original conversations from the used datasets.
|
46 |
Natural Language Processing of StoriesKaley Rittichier (12474468) 28 April 2022 (has links)
<p>In this thesis, I deal with the task of computationally processing stories with a focus on multidisciplinary ends, specifically in Digital Humanities and Cultural Analytics. In the process, I collect, clean, investigate, and predict from two datasets. The first is a dataset of 2,302 open-source literary works categorized by the time period they are set in. These works were all collected from Project Gutenberg. The classification of the time period in which the work is set was discovered by collecting and inspecting Library of Congress subject classifications, Wikipedia Categories, and literary factsheets from SparkNotes. The second is a dataset of 6,991 open-source literary works categorized by the hierarchical location the work is set in; these labels were constructed from Library of Congress subject classifications and SparkNotes factsheets. These datasets are the first of their kind and can help move forward an understanding of 1) the presentation of settings in stories and 2) the effect the settings have on our understanding of the stories.</p>
|
47 |
Joint models for concept-to-text generationKonstas, Ioannis January 2014 (has links)
Much of the data found on the world wide web is in numeric, tabular, or other nontextual format (e.g., weather forecast tables, stock market charts, live sensor feeds), and thus inaccessible to non-experts or laypersons. However, most conventional search engines and natural language processing tools (e.g., summarisers) can only handle textual input. As a result, data in non-textual form remains largely inaccessible. Concept-to-text generation refers to the task of automatically producing textual output from non-linguistic input, and holds promise for rendering non-linguistic data widely accessible. Several successful generation systems have been produced in the past twenty years. They mostly rely on human-crafted rules or expert-driven grammars, implement a pipeline architecture, and usually operate in a single domain. In this thesis, we present several novel statistical models that take as input a set of database records and generate a description of them in natural language text. Our unique idea is to combine the processes of structuring a document (document planning), deciding what to say (content selection) and choosing the specific words and syntactic constructs specifying how to say it (lexicalisation and surface realisation), in a uniform joint manner. Rather than breaking up the generation process into a sequence of local decisions, we define a probabilistic context-free grammar that globally describes the inherent structure of the input (a corpus of database records and text describing some of them). This joint representation allows individual processes (i.e., document planning, content selection, and surface realisation) to communicate and influence each other naturally. We recast generation as the task of finding the best derivation tree for a set of input database records and our grammar, and describe several algorithms for decoding in this framework that allows to intersect the grammar with additional information capturing fluency and syntactic well-formedness constraints. We implement our generators using the hypergraph framework. Contrary to traditional systems, we learn all the necessary document, structural and linguistic knowledge from unannotated data. Additionally, we explore a discriminative reranking approach on the hypergraph representation of our model, by including more refined content selection features. Central to our approach is the idea of porting our models to various domains; we experimented on four widely different domains, namely sportscasting, weather forecast generation, booking flights, and troubleshooting guides. The performance of our systems is competitive and often superior compared to state-of-the-art systems that use domain specific constraints, explicit feature engineering or labelled data.
|
48 |
Influence of limiting working memory resources on contextual facilitation in language processingStewart, Oliver William Thomas January 2014 (has links)
Language processing is a complex task requiring the integration of many different streams of information. Theorists have considered that working memory plays an important role in language processing and that a reduction in available working memory resources will reduce the efficacy of the system. In debate, however, is whether or not there exists a single pool of resources from which all language processes draw, or if the resource pool is functionally fractionated into modular subsections (e.g. syntactic processing, lexical processing etc.). This thesis seeks to investigate the role that working memory capacity plays in the utilisation of context to facilitate language processing. We experimentally manipulated the resources available to each participant using a titrated extrinsic memory load (a string of digits the length of which was tailored to each participant). Participants had to maintain the digits in memory while reading target sentences. Using this methodology we conducted six eyetracking experiments to investigate how a reduction of working memory resources influences the use of context in different language processes. Two experiments examined the resolution of syntactic ambiguities (reduced relative clauses); three examined the resolution of lexical ambiguities (balanced homonyms such as appendix); and one explored semantic predictability (It was a windy day so the boy went to the park to fly his… kite). General conclusions are hard to draw in the face of variable findings. All three experiment areas (syntactic, lexical, and semantic) show that memory loads interact with context, but there is little consistency as to where and how this occurs. In the syntactic experiments we see hints towards a general degradation in context use (supporting Single Resource Theories) whereas in the Lexical and Semantic experiments we see mixed support leaning in the direction of Multiple Resource Theories. Additionally, while individual experiments suggest that limiting working memory resources reduces the role that context plays in guiding both syntactic and lexical ambiguity resolution, more sophisticated statistical investigation indicates that these findings are not reliable. Taken together, the findings of all the experiments lead us to tentatively conclude that imposing limitations on working memory resources can influence the use of context in some language processes, but also that that influence is variant, subtle, and hard to statistically detect.
|
49 |
Applicability analysis of computation double entendre humor recognition with machine learning methodsJohansson, David January 2016 (has links)
No description available.
|
50 |
Latent variable models of distributional lexical semanticsReisinger, Joseph Simon 24 October 2014 (has links)
Computer Sciences / In order to respond to increasing demand for natural language interfaces---and provide meaningful insight into user query intent---fast, scalable lexical semantic models with flexible representations are needed. Human concept organization is a rich phenomenon that has yet to be accounted for by a single coherent psychological framework: Concept generalization is captured by a mixture of prototype and exemplar models, and local taxonomic information is available through multiple overlapping organizational systems. Previous work in computational linguistics on extracting lexical semantic information from unannotated corpora does not provide adequate representational flexibility and hence fails to capture the full extent of human conceptual knowledge. In this thesis I outline a family of probabilistic models capable of capturing important aspects of the rich organizational structure found in human language that can predict contextual variation, selectional preference and feature-saliency norms to a much higher degree of accuracy than previous approaches. These models account for cross-cutting structure of concept organization---i.e. selective attention, or the notion that humans make use of different categorization systems for different kinds of generalization tasks---and can be applied to Web-scale corpora. Using these models, natural language systems will be able to infer a more comprehensive semantic relations, which in turn may yield improved systems for question answering, text classification, machine translation, and information retrieval. / text
|
Page generated in 0.0834 seconds