111 |
Automatic Protein Function Annotation Through Text MiningToonsi, Sumyyah 25 August 2019 (has links)
The knowledge of a protein’s function is essential to many studies in molecular biology, genetic experiments and protein-protein interactions. The Gene Ontology (GO) captures gene products' functions in classes and establishes relationship between them. Manually annotating proteins with GO functions from the bio-medical litera- ture is a tedious process which calls for automation. We develop a novel, dictionary- based method to annotate proteins with functions from text. We extract text-based features from words matched against a dictionary of GO. Since classes are included upon any word match with their class description, the number of negative samples outnumbers the positive ones. To mitigate this imbalance, we apply strict rules before weakly labeling the dataset according to the curated annotations. Furthermore, we discard samples of low statistical evidence and train a logistic regression classifier. The results of a 5-fold cross-validation show a high precision of 91% and 96% accu- racy in the best performing fold. The worst fold showed a precision of 80% and an accuracy of 95%. We conclude by explaining how this method can be used for similar annotation problems.
|
112 |
Quality of SQL Code Security on StackOverflow and Methods of PreventionKlock, Robert 29 July 2021 (has links)
No description available.
|
113 |
A Machine Learning Approach to Predicting Community Engagement on Social Media During DisastersAlshehri, Adel 01 July 2019 (has links)
The use of social media is expanding significantly and can serve a variety of purposes. Over the last few years, users of social media have played an increasing role in the dissemination of emergency and disaster information. It is becoming more common for affected populations and other stakeholders to turn to Twitter to gather information about a crisis when decisions need to be made, and action is taken. However, social media platforms, especially on Twitter, presents some drawbacks when it comes to gathering information during disasters. These drawbacks include information overload, messages are written in an informal format, the presence of noise and irrelevant information. These factors make gathering accurate information online very challenging and confusing, which in turn may affect public, communities, and organizations to prepare for, respond to, and recover from disasters. To address these challenges, we present an integrated three parts (clustering-classification-ranking) framework, which helps users choose through the masses of Twitter data to find useful information. In the first part, we build standard machine learning models to automatically extract and identify topics present in a text and to derive hidden patterns exhibited by a dataset. Next part, we developed a binary and multi-class classification model of Twitter data to categorize each tweet as relevant or irrelevant and to further classify relevant tweets into four types of community engagement: reporting information, expressing negative engagement, expressing positive engagement, and asking for information. In the third part, we propose a binary classification model to categorize the collected tweets into high or low priority tweets. We present an evaluation of the effectiveness of detecting events using a variety of features derived from Twitter posts, namely: textual content, term frequency-inverse document frequency, Linguistic, sentiment, psychometric, temporal, and spatial. Our framework also provides insights for researchers and developers to build more robust socio-technical disasters for identifying types of online community engagement and ranking high-priority tweets in disaster situations.
|
114 |
Interpretability for Deep Learning Text ClassifiersLucaci, Diana 14 December 2020 (has links)
The ubiquitous presence of automated decision-making systems that have a performance
comparable to humans brought attention towards the necessity of interpretability for the
generated predictions. Whether the goal is predicting the system’s behavior when the
input changes, building user trust, or expert assistance in improving the machine learning
methods, interpretability is paramount when the problem is not sufficiently validated in
real applications, and when unacceptable results lead to significant consequences.
While for humans, there are no standard interpretations for the decisions they make,
the complexity of the systems with advanced information-processing capacities conceals
the detailed explanations for individual predictions, encapsulating them under layers of
abstractions and complex mathematical operations. Interpretability for deep learning classifiers becomes, thus, a challenging research topic where the ambiguity of the problem
statement allows for multiple exploratory paths.
Our work focuses on generating natural language interpretations for individual predictions of deep learning text classifiers. We propose a framework for extracting and
identifying the phrases of the training corpus that influence the prediction confidence the
most through unsupervised key phrase extraction and neural predictions. We assess the
contribution margin that the added justification has when the deep learning model predicts
the class probability of a text instance, by introducing and defining a contribution metric
that allows one to quantify the fidelity of the explanation to the model. We assess both
the performance impact of the proposed approach on the classification task as quantitative
analysis and the quality of the generated justifications through extensive qualitative and
error analysis.
This methodology manages to capture the most influencing phrases of the training corpus as explanations that reveal the linguistic features used for individual test predictions,
allowing humans to predict the behavior of the deep learning classifier.
|
115 |
Automated Extraction Of Associations Between Methylated Genes and Diseases From Biomedical LiteratureBin Res, Arwa A. 12 1900 (has links)
Associations between methylated genes and diseases have been investigated in several studies, and it is critical to have such information available for better understanding of diseases and clinical decisions. However, such information is scattered in a large number of electronic publications and it is difficult to manually search for it. Therefore, the goal of the project is to develop a machine learning model that can efficiently extract such information. Twelve machine learning algorithms were applied and compared in application to this problem based on three approaches that involve: document-term frequency matrices, position weight matrices, and a hybrid approach that uses the combination of the previous two. The best results we obtained by the hybrid approach with a random forest model that, in a 10-fold cross-validation, achieved F-score and accuracy of nearly 85% and 84%, respectively. On a completely separate testing set, F-score and accuracy of 89% and 88%, respectively, were obtained. Based on this model, we developed a tool that automates extraction of associations between methylated genes and diseases from electronic text. Our study contributed an efficient method for extracting specific types of associations from free text and the methodology developed here can be extended to other similar association extraction problems.
|
116 |
Intelligent Prediction of Stock Market Using Text and Data Mining TechniquesRaahemi, Mohammad 04 September 2020 (has links)
The stock market undergoes many fluctuations on a daily basis. These changes can be challenging to anticipate. Understanding such volatility are beneficial to investors as it empowers them to make inform decisions to avoid losses and invest when opportunities are predicted to earn funds. The objective of this research is to use text mining and data mining techniques to discover the relationship between news articles and stock prices fluctuations. There are a variety of sources for news articles, including Bloomberg, Google Finance, Yahoo Finance, Factiva, Thompson Routers, and Twitter. In our research, we use Factive and Intrinio news databases. These databases provide daily analytical articles about the general stock market, as well as daily changes in stock prices. The focus of this research is on understanding the news articles which influence stock prices. We believe that different types of stocks in the market behave differently, and news articles could provide indications on different stock price movements. The goal of this research is to create a framework that uses text mining and data mining algorithms to correlate different types of news articles with stock fluctuations to predict whether to “Buy”, “Sell”, or “Hold” a specific stock. We train Doc2Vec models on 1GB of financial news from Factiva to convert news articles into vectors of 100 dimensions. After preprocessing the data, including labeling and balancing the data, we build five predictive models, namely Neural Networks, SVM, Decision Tree, KNN, and Random Forest to predict stock movements (Buy, Sell, or Hold). We evaluate the performances of the predictive models in terms of accuracy and area under the ROC. We conclude that SVM provides the best performance among the five models to predict the stock movement.
|
117 |
Analyzing and evaluating security features in software requirementsHayrapetian, Allenoush 28 October 2016 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Software requirements, for complex projects, often contain specifications of non-functional attributes (e.g., security-related features). The process of analyzing such requirements for standards compliance is laborious and error prone. Due to the inherent free-flowing nature of software requirements, it is tempting to apply Natural Language Processing (NLP) and Machine Learning (ML) based techniques for analyzing these documents. In this thesis, we propose a novel semi-automatic methodology that assesses the security requirements of the software system with respect to completeness and ambiguity, creating a bridge between the requirements documents and being in compliance.
Security standards, e.g., those introduced by the ISO and OWASP, are compared against annotated software project documents for textual entailment relationships (NLP), and the results are used to train a neural network model (ML) for classifying security-based requirements. Hence, this approach aims to identify the appropriate structures that underlie software requirements documents. Once such structures are formalized and empirically validated, they will provide guidelines to software organizations for generating comprehensive and unambiguous requirements specification documents as related to security-oriented features. The proposed solution will assist organizations during the early phases of developing secure software and reduce overall development effort and costs.
|
118 |
Translational drug interaction study using text mining technologyWu, Heng-Yi 15 August 2017 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Drug-Drug Interaction (DDI) is one of the major causes of adverse drug reaction (ADR) and
has been demonstrated to threat public health. It causes an estimated 195,000
hospitalizations and 74,000 emergency room visits each year in the USA alone. Current
DDI research aims to investigate different scopes of drug interactions: molecular level of
pharmacogenetics interaction (PG), pharmacokinetics interaction (PK), and clinical
pharmacodynamics consequences (PD). All three types of experiments are important, but
they are playing different roles for DDI research. As diverse disciplines and varied studies
are involved, interaction evidence is often not available cross all three types of evidence,
which create knowledge gaps and these gaps hinder both DDI and pharmacogenetics
research.
In this dissertation, we proposed to distinguish the three types of DDI evidence (in vitro
PK, in vivo PK, and clinical PD studies) and identify all knowledge gaps in experimental
evidence for them. This is a collective intelligence effort, whereby a text mining tool will
be developed for the large-scale mining and analysis of drug-interaction information such
that it can be applied to retrieve, categorize, and extract the information of DDI from
published literature available on PubMed. To this end, three tasks will be done in this
research work: First, the needed lexica, ontology, and corpora for distinguishing three
different types of studies were prepared. Despite the lexica prepared in this work, a
comprehensive dictionary for drug metabolites or reaction, which is critical to in vitro PK study, is still lacking in pubic databases. Thus, second, a name entity recognition tool will
be proposed to identify drug metabolites and reaction in free text. Third, text mining tools
for retrieving DDI articles and extracting DDI evidence are developed. In this work, the
knowledge gaps cross all three types of DDI evidence can be identified and the gaps
between knowledge of molecular mechanisms underlying DDI and their clinical
consequences can be closed with the result of DDI prediction using the retrieved drug
gene interaction information such that we can exemplify how the tools and methods can
advance DDI pharmacogenetics research. / 2 years
|
119 |
Does Quality Management Practice Influence Performance in the Healthcare Industry?Xie, Heng 08 1900 (has links)
This research examines the relationship between quality management (QM) practices and performance in the healthcare industry via the conduct of three studies. The results of this research contribute both to advancing QM theory as well as in developing a unique text mining method that is illustrated by examining QM in the healthcare industry. Essay 1 explains the relationship between operational performance and QM practices in the healthcare industry. This study analyzed the findings from the literature using meta-analysis. We applied confirmatory semantic analysis (CSA) to examine the Baldrige winners' applications. Essay 2 examines the benefits associated with an effective QM program in the healthcare industry. This study addressed the research question about how effective QM practice results in improved hospital performance. This study compares the performance of Baldrige Award-winning hospitals with matching hospitals, state average, and national average. The results show that the Baldrige Award can lead to an increase in patient satisfaction in certain periods. Essay 3 discusses the contribution of an online clinic appointment system (OCAS) to QM practices. An enhanced trust model was built on understanding the mechanism of patients' trust formation in the OCAS. Understanding the determinants related to patients' trust and willingness to use OCAS can provide valuable guidance for medical institutions to establish health information technology-based services in the quality service improvement programs. This research has three significant contributions. First, this research analyzes the role of QM practices in the healthcare industry. Second, this research attempts to develop a unique text mining method. Third, this research provides a validated trust model and contributes to the body of research on the trust of healthcare information technology.
|
120 |
TEXT MINER FOR HYPERGRAPHS USING OUTPUT SPACE SAMPLINGTirupattur, Naveen 16 August 2011 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Text Mining is process of extracting high-quality knowledge from analysis of
textual data. Rapidly growing interest and focus on research in many fields is
resulting in an overwhelming amount of research literature. This literature is a vast source of knowledge. But due to huge volume of literature, it is practically impossible for researchers to manually extract the knowledge. Hence, there is a need for automated approach to extract knowledge from unstructured data. Text mining is right approach for automated extraction of knowledge from textual data.
The objective of this thesis is to mine documents pertaining to research literature, to find novel associations among entities appearing in that literature using Incremental Mining. Traditional text mining approaches provide binary associations. But it is important to understand context in which these associations occur. For example entity A has association with entity B in context of entity C. These contexts can be visualized as multi-way associations among the entities which are represented by a Hypergraph. This thesis work talks about extracting such multi-way associations among the entities using Frequent Itemset
Mining and application of a new concept called Output space sampling to extract
such multi-way associations in space and time efficient manner. We incorporated concept of personalization in Output space sampling so that user can specify his/her interests as the frequent hyper-associations are extracted from the text.
|
Page generated in 0.0299 seconds