Global ETD Search

281	Application of Data mining in Medical Applications Eapen, Arun George January 2004 (has links) Abstract Data mining is a relatively new field of research whose major objective is to acquire knowledge from large amounts of data. In medical and health care areas, due to regulations and due to the availability of computers, a large amount of data is becoming available. On the one hand, practitioners are expected to use all this data in their work but, at the same time, such a large amount of data cannot be processed by humans in a short time to make diagnosis, prognosis and treatment schedules. A major objective of this thesis is to evaluate data mining tools in medical and health care applications to develop a tool that can help make timely and accurate decisions. Two medical databases are considered, one for describing the various tools and the other as the case study. The first database is related to breast cancer and the second is related to the minimum data set for mental health (MDS-MH). The breast cancer database consists of 10 attributes and the MDS-MH dataset consists of 455 attributes. As there are a number of data mining algorithms and tools available we consider only a few tools to evaluate on these applications and develop classification rules that can be used in prediction. Our results indicate that for the major case study, namely the mental health problem, over 70 to 80% accurate results are possible. A further extension of this work is to make available classification rules in mobile devices such as PDAs. Patient information is directly inputted onto the PDA and the classification of these inputted values takes place based on the rules stored on the PDA to provide real time assistance to practitioners. Systems Design Data mining Health Informatics
282	arules - A Computational Environment for Mining Association Rules and Frequent Item Sets Hornik, Kurt, Grün, Bettina, Hahsler, Michael January 2005 (has links) (PDF) Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules. (authors' abstract)
283	Knowledge-Intensive Subgroup Mining - Techniques for Automatic and Interactive Discovery / Wissensintensive Subgruppenentdeckung – Automatische und Interaktive Entdeckungsmethoden Atzmüller, Martin January 2006 (has links) (PDF) Data mining has proved its significance in various domains and applications. As an important subfield of the general data mining task, subgroup mining can be used, e.g., for marketing purposes in business domains, or for quality profiling and analysis in medical domains. The goal is to efficiently discover novel, potentially useful and ultimately interesting knowledge. However, in real-world situations these requirements often cannot be fulfilled, e.g., if the applied methods do not scale for large data sets, if too many results are presented to the user, or if many of the discovered patterns are already known to the user. This thesis proposes a combination of several techniques in order to cope with the sketched problems: We discuss automatic methods, including heuristic and exhaustive approaches, and especially present the novel SD-Map algorithm for exhaustive subgroup discovery that is fast and effective. For an interactive approach we describe techniques for subgroup introspection and analysis, and we present advanced visualization methods, e.g., the zoomtable that directly shows the most important parameters of a subgroup and that can be used for optimization and exploration. We also describe various visualizations for subgroup comparison and evaluation in order to support the user during these essential steps. Furthermore, we propose to include possibly available background knowledge that is easy to formalize into the mining process. We can utilize the knowledge in many ways: To focus the search process, to restrict the search space, and ultimately to increase the efficiency of the discovery method. We especially present background knowledge to be applied for filtering the elements of the problem domain, for constructing abstractions, for aggregating values of attributes, and for the post-processing of the discovered set of patterns. Finally, the techniques are combined into a knowledge-intensive process supporting both automatic and interactive methods for subgroup mining. The practical significance of the proposed approach strongly depends on the available tools. We introduce the VIKAMINE system as a highly-integrated environment for knowledge-intensive active subgroup mining. Also, we present an evaluation consisting of two parts: With respect to objective evaluation criteria, i.e., comparing the efficiency and the effectiveness of the subgroup discovery methods, we provide an experimental evaluation using generated data. For that task we present a novel data generator that allows a simple and intuitive specification of the data characteristics. The results of the experimental evaluation indicate that the novel SD-Map method outperforms the other described algorithms using data sets similar to the intended application concerning the efficiency, and also with respect to precision and recall for the heuristic methods. Subjective evaluation criteria include the user acceptance, the benefit of the approach, and the interestingness of the results. We present five case studies utilizing the presented techniques: The approach has been successfully implemented in medical and technical applications using real-world data sets. The method was very well accepted by the users that were able to discover novel, useful, and interesting knowledge. / Data Mining wird mit großem Erfolg in vielen Domänen angewandt. Subgruppenentdeckung als wichtiges Teilgebiet des Data Mining kann zum Beispiel gut im Marketing, oder zur Qualitätskontrolle und Analyse in medizinischen Domänen eingesetzt werden. Das allgemeine Ziel besteht darin, potentiell nützliches and letztendlich interessantes Wissen zu entdecken. Jedoch können diese Anforderungen im praktischen Einsatz oft nicht erfüllt werden, etwa falls die eingesetzten Methoden eine schlechte Skalierbarkeit für größere Datensätze aufweisen, falls dem Benutzer zu viele Ergebnisse präsentiert werden, oder falls der Anwender viele der gefundenen Subgruppen-Muster schon kennt. Diese Arbeit stellt eine Kombination von automatischen und interaktiven Techniken vor, um mit den genannten Problemen besser umgehen zu können: Es werden automatische heuristische und vollständige Subgruppenentdeckungs-Verfahren diskutiert, und insbesondere der neuartige SD-Map Algorithmus zur vollständigen Subgruppenentdeckung vorgestellt der sowohl schnell als auch effektiv ist. Bezüglich der interaktiven Techniken werden Methoden zur Subgruppen-Introspektion und Analyse, und fortgeschrittene Visualisierungstechniken vorgestellt, beispielsweise die Zoomtable, die die für die Subgruppenentdeckung wichtigsten Parameter direkt visualisiert und zur Optimierung und Exploration eingesetzt werden kann. Zusätzlich werden verschiedene Visualisierungen zum Vergleich und zur Evaluation von Subgruppen beschrieben um den Benutzer bei diesen essentiellen Schritten zu unterstützen. Weiterhin wird leicht zu formalisierendes Hintergrundwissen vorgestellt, das im Subgruppenentdeckungsprozess in vielfältiger Weise eingesetzt werden kann: Um den Entdeckungsprozess zu fokussieren, den Suchraum einzuschränken, und letztendlich die Effizienz der Entdeckungsmethode zu erhöhen. Insbesondere wird Hintergrundwissen eingeführt, um die Elemente der Anwendungsdomäne zu filtern, um geeignete Abstraktionen zu definieren, Werte zusammenzufassen, und die gefundenen Subgruppenmuster nachzubearbeiten. Schließlich werden diese Techniken in einen wissensintensiven Prozess integriert, der sowohl automatische als auch interaktive Methoden zur Subgruppenentdeckung einschließt. Die praktische Bedeutung des vorgestellten Ansatzes hängt stark von den verfügbaren Werkzeugen ab. Dazu wird das VIKAMINE System als hochintegrierte Umgebung für die wissensintensive aktive Subgruppenentdeckung präsentiert. Die Evaluation des Ansatzes besteht aus zwei Teilen: Hinsichtlich einer Evaluation von Effizienz und Effektivität der Verfahren wird eine experimentelle Evaluation mit synthetischen Daten vorgestellt. Für diesen Zweck wird ein neuartiger in der Arbeit entwickelter Datengenerator angewandt, der eine einfache und intuitive Spezifikation der Datencharakteristiken erlaubt. Für die Evaluation des Ansatzes wurden Daten erzeugt, die ähnliche Charakteristiken aufweisen wie die Daten des angestrebten Einsatzbereichs. Die Ergebnisse der Evaluation zeigen, dass der neuartige SD-Map Algorithmus den anderen in der Arbeit beschriebenen Standard-Algorithmen überlegen ist. Sowohl hinsichtlich der Effizienz, als auch von Precision/Recall bezogen auf die heuristischen Algorithmen bietet SD-Map deutliche Vorteile. Subjektive Evaluationskriterien sind durch die Benutzerakzeptanz, den Nutzen des Ansatzes, und die Interessantheit der Ergebnisse gegeben. Es werden fünf Fallstudien für den Einsatz der vorgestellten Techniken beschrieben: Der Ansatz wurde in medizinischen und technischen Anwendungen mit realen Daten eingesetzt. Dabei wurde er von den Benutzern sehr gut angenommen, und im praktischen Einsatz konnte neuartiges, nützliches, und interessantes Wissen entdeckt werden. Data Mining Algorithmus Visualisierung ddc:004
284	Data Mining for Car Insurance Claims Prediction Huangfu, Dan 27 April 2015 (has links) A key challenge for the insurance industry is to charge each customer an appropriate price for the risk they represent. Risk varies widely from customer to customer, and a deep understanding of different risk factors helps predict the likelihood and cost of insurance claims. The goal of this project is to see how well various statistical methods perform in predicting bodily injury liability Insurance claim payments based on the characteristics of the insured customerâ€™s vehicles for this particular dataset from Allstate Insurance Company.We tried several statistical methods, including logistic regression, Tweedieâ€™s compound gamma-Poisson model, principal component analysis (PCA), response averaging, and regression and decision trees. From all the models we tried, PCA combined with a with a Regression Tree produced the best results. This is somewhat surprising given the widespread use of the Tweedie model for insurance claim prediction problems. data mining statistical methods prediction gini
285	Profiling television viewing using data mining Chanza, Martin Mudongo 25 April 2013 (has links) A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science Johannesburg, February 2013 / This study conducted a critical review of data-mining techniques used to extract meaningful information from very large databases. The study aimed to determine cluster analysis methods suitable for the analysis of binary television-viewing data. Television-viewing data from the South African Broadcasting Corporation was used for the analysis. Partitioning and hierarchical clustering methods are compared in the dissertation. The study also examines distance measures used in the clustering of binary data. Particular consideration was given to methods for determining the most appropriate number of clusters to extract. Based on the results of the cluster analysis, four television-viewer profiles were determined. These viewer profiles will enable the South African Broadcasting Corporation to provide viewer-targeted programming. Data mining. Television viewers - South Africa.
286	Harnessing User Data to Improve Facebook Features Epstein, Greg January 2010 (has links) Thesis advisor: Sergio Alvarez / The recent explosion of online social networking through sites like Twitter, MySpace, Facebook has millions of users spending hours a day sorting through information on their friends, coworkers and other contacts. These networks also house massive amounts of user activity information that is often used for advertising purposes but can be utilized for other activities as well. Facebook, now the most popular in terms of registered users, active users and page rank, has a sparse offering of built-in filtering and predictive tools such as ``suggesting a friend'' or the ``Top News'' feed filter. However these basic tools seem to underutilize the information that Facebook stores on all of its users. This paper explores how to better use available Facebook data to create more useful tools to assist users in sorting through their activities on Facebook. / Thesis (BS) — Boston College, 2010. / Submitted to: Boston College. College of Arts and Sciences. / Discipline: Computer Science Honors Program. / Discipline: College Honors Program. / Discipline: Computer Science. Facebook social networks data mining internet
287	Sequential Data Mining and its Applications to Pharmacovigilance Qin, Xiao 02 April 2019 (has links) With the phenomenal growth of digital devices coupled with their ever-increasing capabilities of data generation and storage, sequential data is becoming more and more ubiquitous in a wide spectrum of application scenarios. There are various embodiments of sequential data such as temporal database, time series and text (word sequence) where the first one is synchronous over time and the latter two often generated in an asynchronous fashion. In order to derive precious insights, it is critical to learn and understand the behavior dynamics as well as the causality relationships across sequences. Pharmacovigilance is defined as the science and activities relating to the detection, assessment, understanding and prevention of adverse drug reactions (ADR) or other drug-related problems. In the post-marketing phase, the effectiveness and the safety of drugs is monitored by regulatory agencies known as post-marketing surveillance. Spontaneous Reporting System (SRS), e.g., U.S. Food and Drug Administration Adverse Event Reporting System (FAERS), collects drug safety complaints over time providing the key evidence to support regularity actions towards the reported products. With the rapid growth of the reporting volume and velocity, data mining techniques promise to be effective to facilitating drug safety reviewers performing supervision tasks in a timely fashion. My dissertation studies the problem of exploring, analyzing and modeling various types of sequential data within a typical SRS: Temporal Correlations Discovery and Exploration. SRS can be seen as a temporal database where each transaction encodes the co-occurrence of some reported drugs and observed ADRs in a time frame. Temporal association rule learning (TARL) has been proven to be a prime candidate to derive associations among the objects from such temporal database. However, TARL is parameterized and computational expensive making it difficult to use for discovering interesting association among drugs and ADRs in a timely fashion. Worse yet, existing interestingness measures fail to capture the significance of certain types of association in the context of pharmacovigilance, e.g. drug-drug interaction (DDI) related ADR. To discover DDI related ADR using TARL, we propose an interestingness measure that aligns with the DDI semantics. We propose an interactive temporal association analytics framework that supports real-time temporal association derivation and exploration. Anomaly Detection in Time Series. Abnormal reports may reveal meaningful ADR case which is overlooked by frequency-based data mining approach such as association rule learning where patterns are derived from frequently occurred events. In addition, the sense of abnormal or rareness may vary in different contexts. For example, an ADR, normally occurs to adult population, may rarely happen to youth population but with life threatening outcomes. Local outlier factor (LOF) is identified as a suitable approach to capture such local abnormal phenomenon. However, existing LOF algorithms and its variations fail to cope with high velocity data streams due to its high algorithmic complexity. We propose new local outlier semantics that leverage kernel density estimation (KDE) to effectively detect local outliers from streaming data. A strategy to continuously detect top-N KDE-based local outliers over streams is also designed, called KELOS -- the first linear time complexity streaming local outlier detection approach. Text Modeling. Language modeling (LM) is a fundamental problem in many natural language processing (NLP) tasks. LM is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it. Recently, LM is advanced by the success of the recurrent neural networks (RNNs) which overcome the Markov assumption made in the traditional statistical language models. In theory, RNNs such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) can â€œrememberâ€� arbitrarily long span of history if provided with enough capacity. However, they do not perform well on very long sequences in practice as the gradient computation for RNNs becomes increasingly ill-behaved as the expected dependency becomes longer. One way of tackling this problem is to feed succinct information that encodes the semantic structure of the entire document such as latent topics as context to guide the modeling process. Clinical narratives that describe complex medical events are often accompanied by meta-information such as a patient's demographics, diagnoses and medications. This structured information implicitly relates to the logical and semantic structure of the entire narrative, and thus affects vocabulary choices for the narrative composition. To leverage this meta-information, we propose a supervised topic compositional neural language model, called MeTRNN, that integrates the strength of supervised topic modeling in capturing global semantics with the capacity of contextual recurrent neural networks (RNN) in modeling local word dependencies. data mining information management pharmacovigilance sequential data
288	Modélisation et optimisation non convexe basées sur la programmation DC et DCA pour la résolution de certaines classes des problèmes en fouille de données et cryptologie / The non-convex modeling and optimization based on the DC programming and DCA for the resolution of certain classes of problems in Data Mining and cryptology Le, Hoai Minh 24 October 2007 (has links) Cette thèse est consacrée à la modélisation et l'optimisation non convexe basées sur la programmation DC et DCA pour certaines classes de problèmes issus de deux domaines importants : le Data Mining et la Cryptologie. Il s'agit des problèmes d'optimisation non convexe de très grande dimension pour lesquels la recherche des bonnes méthodes de résolution est toujours d'actualité. Notre travail s'appuie principalement sur la programmation DC et DCA. Cette démarche est motivée par la robustesse et la performance de la programmation DC et DCA, leur adaptation aux structures des problèmes traités et leur capacité de résoudre des problèmes de grande dimension. La thèse est divisée en trois parties. Dans la première partie intitulée Méthodologie nous présentons des outils théoriques servant des références aux autres. Le premier chapitre concerne la programmation DC et DCA tandis que le deuxième porte sur les algorithmes génétiques. Dans la deuxième partie nous développons la programmation DC et DCA pour la résolution de deux classes de problèmes en Data Mining. Dans le chapitre quatre, nous considérons le modèle de la classification floue FCM et développons la programmation DC et DCA pour sa résolution. Plusieurs formulations DC correspondants aux différentes décompositions DC sont proposées. Notre travail en classification hiérarchique (chapitre cinq) est motivé par une de ses applications intéressante et très importantes, à savoir la communication multicast. C'est un problème non convexe, non différentiable de très grande dimension pour lequel nous avons reformulé sous la forme des trois programmes DC différents et développé les DCA correspondants. La troisième partie porte sur la Cryptologie. Le premier concerne la construction des fonctions booléennes équilibrées de haut degré de non-linéarité - un des problèmes cruciaux en Cryptographie. Plusieurs versions de combinaison de deux approches - DCA et les algorithmes génétiques (AG) sont étudiées dans le but d'exploiter simultanément l'efficacité de chaque approche. Le deuxième travail concerne des techniques de cryptanalyse d'un schéma d'identification basé sur les deux problèmes ''Perceptron'' (PP) et ''Perceptron Permuté'' (PPP). Nous proposons une méthode de résolution des deux problèmes PP et PPP par DCA et une méthode de coupes dans le dernier chapitre / This thesis is dedicated to non-convex modeling and the optimization based on the DC programming and DCA for certain classes of problems of two important domains : the Data Mining and the Cryptology. They are non-convex optimization problems of very large dimensions for which the research of good solution methods is always of actuality. Our work is based mainly on the DC programming and DCA that have been successfully applied in various fields of applied sciences, including machine learning. It is motivated and justified by the robustness and the good performance of DC programming and DCA in comparison with the existing methods. This thesis is devised in three parties. The first part, entitling Methodology, serves as a reference for other chapters. The first chapter concerns the programming of DC and DCA while the second chapter describes the genetic algorithms. In the second part, we develop the DC and DCA programming to solve two classes of problems in Data Mining. In the chapter four, we take consideration into the model of classification FCM and develop the programming DC and DCA for their resolution. Many formulations DC in correspondence to different decompositions DC are proposed. Our work in hierarchic classification (chapter 5) is motivated by one of its interesting and very important applications, known as muliticast communication. It's a non-convex, non differentiable, non-convex problem in a very big dimension with which we have reformulated in the forms of 3 different DC programs and developed the DCA relative. The 3rd part focuses on the Cryptology. The 1st chapter is the construction of stable boonlean functions with high degree of non-linearity - one of the crucial problems of Cryptography. Many versions of combination of 2 approaches, DCA and Genetic Algorithms (GA) are studied in the purpose of exploiting simultaneously the efficacy of each approach. The secondrd work is about the techinics of cryptanalyse of a identification scheme based on two problems Perceptron (PP) and Perceptron Permuted. We propose a method of resolving two problems PP and PPA by DCA and a cutting plan method in the last chapter Programmation DC Optimisation globale Cryptologie Data mining
289	APPLICATIONS OF DATA MINING IN HEALTHCARE Bo Peng (6618929) 10 June 2019 (has links) With increases in the quantity and quality of healthcare related data, data mining tools have the potential to improve people’s standard of living through personalized and pre-<br>dictive medicine. In this thesis we improve the state-of-the-art in data mining for several problems in the healthcare domain. In problems such as drug-drug interaction prediction<br>and Alzheimer’s Disease (AD) biomarkers discovery and prioritization, current methods either require tedious feature engineering or have unsatisfactory performance. New effective computational tools are needed that can tackle these complex problems.<br>In this dissertation, we develop new algorithms for two healthcare problems: high-order drug-drug interaction prediction and amyloid imaging biomarker prioritization in<br>Alzheimer’s Disease. Drug-drug interactions (DDIs) and their associated adverse drug reactions (ADRs) represent a significant detriment to the public h ealth. Existing research on DDIs primarily focuses on pairwise DDI detection and prediction. Effective computational methods for high-order DDI prediction are desired. In this dissertation, I present a deep learning based model D3I for cardinality-invariant and order-invariant high-order DDI prediction. The proposed models achieve 0.740 F1 value and 0.847 AUC value on high-order DDI prediction, and outperform classical methods on order-2 DDI prediction. These results demonstrate the strong potential of D 3 I and deep learning based models in tackling the prediction problems of high-order DDIs and their induced ADRs.<br>The second problem I consider in this thesis is amyloid imaging biomarkers discovery, for which I propose an innovative machine learning paradigm enabling precision medicine in this domain. The paradigm tailors the imaging biomarker discovery process to individual characteristics of a given patient. I implement this paradigm using a newly developed learning-to-rank method PLTR. The PLTR model seamlessly integrates two objectives for joint optimization: pushing up relevant biomarkers and ranking among relevant biomarkers. The empirical study of PLTR conducted on the ADNI data yields promising results to identify and prioritize individual-specific amyloid imaging biomarkers based on the individual’s structural MRI data. The resulting top ranked imaging biomarkers have the potential to aid personalized diagnosis and disease subtyping. Applied Computer Science data Mining Techniques Applied
290	Ensemble Learning Algorithms for the Analysis of Bioinformatics Data Unknown Date (has links) Developments in advanced technologies, such as DNA microarrays, have generated tremendous amounts of data available to researchers in the field of bioinformatics. These state-of-the-art technologies present not only unprecedented opportunities to study biological phenomena of interest, but significant challenges in terms of processing the data. Furthermore, these datasets inherently exhibit a number of challenging characteristics, such as class imbalance, high dimensionality, small dataset size, noisy data, and complexity of data in terms of hard to distinguish decision boundaries between classes within the data. In recognition of the aforementioned challenges, this dissertation utilizes a variety of machine-learning and data-mining techniques, such as ensemble classification algorithms in conjunction with data sampling and feature selection techniques to alleviate these problems, while improving the classification results of models built on these datasets. However, in building classification models researchers and practitioners encounter the challenge that there is not a single classifier that performs relatively well in all cases. Thus, numerous classification approaches, such as ensemble learning methods, have been developed to address this problem successfully in a majority of circumstances. Ensemble learning is a promising technique that generates multiple classification models and then combines their decisions into a single final result. Ensemble learning often performs better than single-base classifiers in performing classification tasks. This dissertation conducts thorough empirical research by implementing a series of case studies to evaluate how ensemble learning techniques can be utilized to enhance overall classification performance, as well as improve the generalization ability of ensemble models. This dissertation investigates ensemble learning techniques of the boosting, bagging, and random forest algorithms, and proposes a number of modifications to the existing ensemble techniques in order to improve further the classification results. This dissertation examines the effectiveness of ensemble learning techniques on accounting for challenging characteristics of class imbalance and difficult-to-learn class decision boundaries. Next, it looks into ensemble methods that are relatively tolerant to class noise, and not only can account for the problem of class noise, but improves classification performance. This dissertation also examines the joint effects of data sampling along with ensemble techniques on whether sampling techniques can further improve classification performance of built ensemble models. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2016. / FAU Electronic Theses and Dissertations Collection Bioinformatics. Machine learning.

Search results