Global ETD Search

1	The Cauchy-Net Mixture Model for Clustering with Anomalous Data Slifko, Matthew D. 11 September 2019 (has links) We live in the data explosion era. The unprecedented amount of data offers a potential wealth of knowledge but also brings about concerns regarding ethical collection and usage. Mistakes stemming from anomalous data have the potential for severe, real-world consequences, such as when building prediction models for housing prices. To combat anomalies, we develop the Cauchy-Net Mixture Model (CNMM). The CNMM is a flexible Bayesian nonparametric tool that employs a mixture between a Dirichlet Process Mixture Model (DPMM) and a Cauchy distributed component, which we call the Cauchy-Net (CN). Each portion of the model offers benefits, as the DPMM eliminates the limitation of requiring a fixed number of a components and the CN captures observations that do not belong to the well-defined components by leveraging its heavy tails. Through isolating the anomalous observations in a single component, we simultaneously identify the observations in the net as warranting further inspection and prevent them from interfering with the formation of the remaining components. The result is a framework that allows for simultaneously clustering observations and making predictions in the face of the anomalous data. We demonstrate the usefulness of the CNMM in a variety of experimental situations and apply the model for predicting housing prices in Fairfax County, Virginia. / Doctor of Philosophy / We live in the data explosion era. The unprecedented amount of data offers a potential wealth of knowledge but also brings about concerns regarding ethical collection and usage. Mistakes stemming from anomalous data have the potential for severe, real-world consequences, such as when building prediction models for housing prices. To combat anomalies, we develop the Cauchy-Net Mixture Model (CNMM). The CNMM is a flexible tool for identifying and isolating the anomalies, while simultaneously discovering cluster structure and making predictions among the nonanomalous observations. The result is a framework that allows for simultaneously clustering and predicting in the face of the anomalous data. We demonstrate the usefulness of the CNMM in a variety of experimental situations and apply the model for predicting housing prices in Fairfax County, Virginia. Bayesian Nonparametrics Dirichlet Process Mixture Model Clustering Anomaly Detection
2	Bayesian variable selection in clustering via dirichlet process mixture models Kim, Sinae 17 September 2007 (has links) The increased collection of high-dimensional data in various fields has raised a strong interest in clustering algorithms and variable selection procedures. In this disserta- tion, I propose a model-based method that addresses the two problems simultane- ously. I use Dirichlet process mixture models to define the cluster structure and to introduce in the model a latent binary vector to identify discriminating variables. I update the variable selection index using a Metropolis algorithm and obtain inference on the cluster structure via a split-merge Markov chain Monte Carlo technique. I evaluate the method on simulated data and illustrate an application with a DNA microarray study. I also show that the methodology can be adapted to the problem of clustering functional high-dimensional data. There I employ wavelet thresholding methods in order to reduce the dimension of the data and to remove noise from the observed curves. I then apply variable selection and sample clustering methods in the wavelet domain. Thus my methodology is wavelet-based and aims at clustering the curves while identifying wavelet coefficients describing discriminating local features. I exemplify the method on high-dimensional and high-frequency tidal volume traces measured under an induced panic attack model in normal humans. Bayesian inference Clustering Dirichlet process mixture model DNA microarray data analysis variable selection wavelet shrinkage
3	Approaches to Find the Functionally Related Experiments Based on Enrichment Scores: Infinite Mixture Model Based Cluster Analysis for Gene Expression Data Li, Qian 18 October 2013 (has links) No description available. Statistics Microarray data Clustering Analysis Dirichlet process mixture model Infinite Mixture model Enrichment score Functional category
4	Bayesian Nonparametric Reliability Analysis Using Dirichlet Process Mixture Model Cheng, Nan 03 October 2011 (has links) No description available. Engineering Dirichlet Process Mixture Model (DPMM)
5	Recurrent-Event Models for Change-Points Detection Li, Qing 23 December 2015 (has links) The driving risk of novice teenagers is the highest during the initial period after licensure but decreases rapidly. This dissertation develops recurrent-event change-point models to detect the time when driving risk decreases significantly for novice teenager drivers. The dissertation consists of three major parts: the first part applies recurrent-event change-point models with identical change-points for all subjects; the second part proposes models to allow change-points to vary among drivers by a hierarchical Bayesian finite mixture model; the third part develops a non-parametric Bayesian model with a Dirichlet process prior. In the first part, two recurrent-event change-point models to detect the time of change in driving risks are developed. The models are based on a non-homogeneous Poisson process with piecewise constant intensity functions. It is shown that the change-points only occur at the event times and the maximum likelihood estimators are consistent. The proposed models are applied to the Naturalistic Teenage Driving Study, which continuously recorded textit{in situ} driving behaviour of 42 novice teenage drivers for the first 18 months after licensure using sophisticated in-vehicle instrumentation. The results indicate that crash and near-crash rate decreases significantly after 73 hours of independent driving after licensure. The models in part one assume identical change-points for all drivers. However, several studies showed that different patterns of risk change over time might exist among the teenagers, which implies that the change-points might not be identical among drivers. In the second part, change-points are allowed to vary among drivers by a hierarchical Bayesian finite mixture model, considering that clusters exist among the teenagers. The prior for mixture proportions is a Dirichlet distribution and a Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. DIC is used to determine the best number of clusters. Based on the simulation study, the model gives fine results under different scenarios. For the Naturalist Teenage Driving Study data, three clusters exist among the teenagers: the change-points are 52.30, 108.99 and 150.20 hours of driving after first licensure correspondingly for the three clusters; the intensity rates increase for the first cluster while decrease for other two clusters; the change-point of the first cluster is the earliest and the average intensity rate is the highest. In the second part, model selection is conducted to determine the number of clusters. An alternative is the Bayesian non-parametric approach. In the third part, a Dirichlet process Mixture Model is proposed, where the change-points are assigned a Dirichlet process prior. A Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. Automatic clustering is expected based on change-points without specifying the number of latent clusters. Based on the Dirichlet process mixture model, three clusters exist among the teenage drivers for the Naturalistic Teenage Driving Study. The change-points of the three clusters are 96.31, 163.83, and 279.19 hours. The results provide critical information for safety education, safety countermeasure development, and Graduated Driver Licensing policy making. / Ph. D. Bayesian Finite Mixture Model Clustering Constant Piecewise Intensity Dirichlet Process Mixture Model Maximum Likelihood Estimate Naturalistic Teenage Driving Study Non-Homogeneous Poisson Process
6	Exploring Single-molecule Heterogeneity and the Price of Cell Signaling Wang, Tenglong 25 January 2022 (has links) No description available. Physics Biophysics Energetic costs of signaling Kinase-phosphatase enzymatic network ATP consumption Mutual information Information theory Evolutionary pressure Atomic force microscopy Non-parametric Bayesian learning Dirichlet process mixture model "Chinese restaurant process"
7	Non-Parametric Clustering of Multivariate Count Data Tekumalla, Lavanya Sita January 2017 (has links) (PDF) The focus of this thesis is models for non-parametric clustering of multivariate count data. While there has been significant work in Bayesian non-parametric modelling in the last decade, in the context of mixture models for real-valued data and some forms of discrete data such as multinomial-mixtures, there has been much less work on non-parametric clustering of Multi-variate Count Data. The main challenges in clustering multivariate counts include choosing a suitable multivariate distribution that adequately captures the properties of the data, for instance handling over-dispersed data or sparse multivariate data, at the same time leveraging the inherent dependency structure between dimensions and across instances to get meaningful clusters. As the first contribution, this thesis explores extensions to the Multivariate Poisson distribution, proposing efficient algorithms for non-parametric clustering of multivariate count data. While Poisson is the most popular distribution for count modelling, the Multivariate Poisson often leads to intractable inference and a suboptimal t of the data. To address this, we introduce a family of models based on the Sparse-Multivariate Poisson, that exploit the inherent sparsity in multivariate data, reducing the number of latent variables in the formulation of Multivariate Poisson leading to a better t and more efficient inference. We explore Dirichlet process mixture model extensions and temporal non-parametric extensions to models based on the Sparse Multivariate Poisson for practical use of Poisson based models for non-parametric clustering of multivariate counts in real-world applications. As a second contribution, this thesis addresses moving beyond the limitations of Poisson based models for non-parametric clustering, for instance in handling over dispersed data or data with negative correlations. We explore, for the first time, marginal independent inference techniques based on the Gaussian Copula for multivariate count data in the Dirichlet Process mixture model setting. This enables non-parametric clustering of multivariate counts without limiting assumptions that usually restrict the marginal to belong to a particular family, such as the Poisson or the negative-binomial. This inference technique can also work for mixed data (combination of counts, binary and continuous data) enabling Bayesian non-parametric modelling to be used for a wide variety of data types. As the third contribution, this thesis addresses modelling a wide range of more complex dependencies such as asymmetric and tail dependencies during non-parametric clustering of multivariate count data with Vine Copula based Dirichlet process mixtures. While vine copula inference has been well explored for continuous data, it is still a topic of active research for multivariate counts and mixed multivariate data. Inference for multivariate counts and mixed data is a hard problem owing to ties that arise with discrete marginal. An efficient marginal independent inference approach based on extended rank likelihood, based on recent work in the statistics literature, is proposed in this thesis, extending the use vines for multivariate counts and mixed data in practical clustering scenarios. This thesis also explores the novel systems application of Bulk Cache Preloading by analysing I/O traces though predictive models for temporal non-parametric clustering of multivariate count data. State of the art techniques in the caching domain are limited to exploiting short-range correlations in memory accesses at the milli-second granularity or smaller and cannot leverage long range correlations in traces. We explore for the first time, Bulk Cache Preloading, the process of pro-actively predicting data to load into cache, minutes or hours before the actual request from the application, by leveraging longer range correlation at the granularity of minutes or hours. This enables the development of machine learning techniques tailored for caching due to relaxed timing constraints. Our approach involves a data aggregation process, converting I/O traces into a temporal sequence of multivariate counts, that we analyse with the temporal non-parametric clustering models proposed in this thesis. While the focus of our thesis is models for non-parametric clustering for discrete data, particularly multivariate counts, we also hope our work on bulk cache preloading paves the way to more inter-disciplinary research for using data mining techniques in the systems domain. As an additional contribution, this thesis addresses multi-level non-parametric admixture modelling for discrete data in the form of grouped categorical data, such as document collections. Non-parametric clustering for topic modelling in document collections, where a document is as-associated with an unknown number of semantic themes or topics, is well explored with admixture models such as the Hierarchical Dirichlet Process. However, there exist scenarios, where a doc-ument requires being associated with themes at multiple levels, where each theme is itself an admixture over themes at the previous level, motivating the need for multilevel admixtures. Consider the example of non-parametric entity-topic modelling of simultaneously learning entities and topics from document collections. This can be realized by modelling a document as an admixture over entities while entities could themselves be modeled as admixtures over topics. We propose the nested Hierarchical Dirichlet Process to address this gap and apply a two level version of our model to automatically learn author entities and topics from research corpora. Multivariate Count Data Clustering Mixture Models Non-parametric Clustering Bulk Cache Preloading Dirichlet Process Mixture Models Spatio-Temporal Data Aggregation Sparse Multivariate Poisson MultiVariate Poisson (MVP) Copulas Nested Hierarchical Dirichlet Processes Dirichlet Process Mixtures Sparse-Multivariate Poisson Dirichlet Process Mixture Model Computer Science

1

Page generated in 0.0852 seconds