1 |
Linkage Based Dirichlet ProcessesSong, Yuhyun 08 February 2017 (has links)
We live in the era of textit{Big Data} with significantly richer computational resources than the last two decades. The concurrence of computation resources and a large volume of data has boosted researchers' desire for developing feasible Markov Chain Monte Carlo (MCMC) algorithms for large parameter spaces. Dirichlet Process Mixture Models (DPMMs) have become a Bayesian mainstay for modeling heterogeneous structures, namely clusters, especially when the quantity of clusters is not known with the established MCMC methods. As opposed to many ad-hoc clustering methods, using Dirichlet Processes (DPs) in models provide a flexible and probabilistic approach for automatically estimating both cluster structure and quantity. While DPs are not fully parameterized, they depend on both a base measure and a concentration parameter that can heavily impact inferences.
Determining the concentration parameter is critical and essential, since it adjusts the a-priori cluster expectation, but typical approaches for specifying this parameter are rather cavalier. In this work, we propose a new method for automatically and adaptively determining this parameter, which directly calibrates distances between clusters through an explicit link function within the DP. Furthermore, we extend our method to mixture models with Nested Dirichlet Processes (NDPs) that cluster the multilevel data and depend on the specification of a vector of concentration parameters. In this work, we detail how to incorporate our method in Markov chain Monte Carlo algorithms, and illustrate our findings through a series of comparative simulation studies and applications. / Ph. D. / We live in the era of <i>Big Data</i> with significantly richer computational resources than the last two decades. The concurrence of computational resources and a large volume of data has boosted researcher’s desire to develop the efficient Markov Chain Monte Carlo (MCMC) algorithms for models such as a Dirichlet process mixture model. The Dirichlet process mixture model has become more popular for clustering analyses because it provides a flexible and generative model for automatically defining both cluster structure and quantity. However, a clustering solution inferred by the Dirichlet process mixture model is impacted by the hyperparameters called a base measure and a concentration parameter.
Determining the concentration parameter is critical and essential, since it adjusts the apriori cluster expectation, but typical approaches for specifying this parameter are rather cavalier. In this work, we propose a new method for automatically and adaptively determining this parameter, which directly calibrates distances between clusters. Furthermore, we extend our method to mixture models with Nested Dirichlet Processes (NDPs) that cluster the multilevel data and depend on the specification of a vector of concentration parameters. In this work, we have simulation studies to show the performance of the developed methods and applications such as modeling the timeline for building construction data and clustering the U.S median household income data.
This work has contributions: 1) the developed methods in this work are straightforward to incorporate with any type of Monte Carlo Markov Chain algorithms, 2) methods calibrate with the probability distance between clusters and maximize the information based on the observations in defined clusters when estimating the concentration parameter, and 3) the methods can be extended to any type of the extension of Dirichlet processes, for instance, hierarchical Dirichlet processes or dependent Dirichlet processes.
|
2 |
Nonparametric Bayesian analysis of some clustering problemsRay, Shubhankar 30 October 2006 (has links)
Nonparametric Bayesian models have been researched extensively in the past 10 years
following the work of Escobar and West (1995) on sampling schemes for Dirichlet processes.
The infinite mixture representation of the Dirichlet process makes it useful
for clustering problems where the number of clusters is unknown. We develop nonparametric
Bayesian models for two different clustering problems, namely functional
and graphical clustering.
We propose a nonparametric Bayes wavelet model for clustering of functional or
longitudinal data. The wavelet modelling is aimed at the resolution of global and
local features during clustering. The model also allows the elicitation of prior belief
about the regularity of the functions and has the ability to adapt to a wide range
of functional regularity. Posterior inference is carried out by Gibbs sampling with
conjugate priors for fast computation. We use simulated as well as real datasets to
illustrate the suitability of the approach over other alternatives.
The functional clustering model is extended to analyze splice microarray data.
New microarray technologies probe consecutive segments along genes to observe alternative
splicing (AS) mechanisms that produce multiple proteins from a single gene.
Clues regarding the number of splice forms can be obtained by clustering the functional
expression profiles from different tissues. The analysis was carried out on the Rosetta dataset (Johnson et al., 2003) to obtain a splice variant by tissue distribution
for all the 10,000 genes. We were able to identify a number of splice forms that appear
to be unique to cancer.
We propose a Bayesian model for partitioning graphs depicting dependencies
in a collection of objects. After suitable transformations and modelling techniques,
the problem of graph cutting can be approached by nonparametric Bayes clustering.
We draw motivation from a recent work (Dhillon, 2001) showing the equivalence of
kernel k-means clustering and certain graph cutting algorithms. It is shown that
loss functions similar to the kernel k-means naturally arise in this model, and the
minimization of associated posterior risk comprises an effective graph cutting strategy.
We present here results from the analysis of two microarray datasets, namely the
melanoma dataset (Bittner et al., 2000) and the sarcoma dataset (Nykter et al.,
2006).
|
3 |
Discovering interpretable topics in free-style text: diagnostics, rare topics, and topic supervisionZheng, Ning 07 January 2008 (has links)
No description available.
|
4 |
Unveiling Covariate Inclusion Structures In Economic Growth Regressions Using Latent Class AnalysisCrespo Cuaresma, Jesus, Grün, Bettina, Hofmarcher, Paul, Humer, Stefan, Moser, Mathias January 2016 (has links) (PDF)
We propose the use of Latent Class Analysis methods to analyze the covariate inclusion patterns across specifications resulting from Bayesian model averaging exercises. Using Dirichlet Process clustering, we are able to identify and describe dependency structures among variables in terms of inclusion in the specifications that compose the model space. We apply the method to two datasets of potential determinants of economic growth. Clustering the posterior covariate inclusion structure of the model space formed by linear regression models reveals interesting patterns of complementarity and substitutability across economic growth determinants.
|
5 |
Benchmark estimation for Markov Chain Monte Carlo samplersGuha, Subharup 18 June 2004 (has links)
No description available.
|
6 |
Um modelo Bayesiano semi-paramétrico para o monitoramento ``on-line\" de qualidade de Taguchi para atributos / A semi-parametric model for Taguchi´s On-Line Quality-Monitoring Procedure for AttributesTsunemi, Miriam Harumi 27 April 2009 (has links)
Este modelo contempla o cenário em que a sequência de frações não-conformes no decorrer de um ciclo do processo de produção aumenta gradativamente (situação comum, por exemplo, quando o desgaste de um equipamento é gradual), diferentemente dos modelos de Taguchi, Nayebpour e Woodall e Nandi e Sreehari (1997), que acomodam sequências de frações não-conformes assumindo no máximo três valores, e de Nandi e Sreehari (1999) e Trindade, Ho e Quinino (2007) que contemplam funções de degradação mais simples. O desenvolvimento é baseado nos trabalhos de Ferguson e Antoniak para o cálculo da distribuição a posteriori de uma medida P desconhecida, associada a uma função de distribuição F desconhecida que representa a sequência de frações não-conformes ao longo de um ciclo, supondo, a priori, mistura de Processos Dirichlet. A aplicação consiste na estimação da função de distribuição F e as estimativas de Bayes são analisadas através de alguns casos particulares / In this work, we propose an alternative model for Taguchi´s On-Line Quality-Monitoring Procedure for Attributes under a Bayesian nonparametric framework. This model may be applied to production processes the sequences of defective fractions during a cycle of which increase gradually (for example, when an equipment deteriorates little by little), differently from either Taguchi\'s, Nayebpour and Woodall\'s and Nandi and Sreehari\'s models that allow at most three values for the defective fraction or Nandi and Sreehari\'s and Trindade, Ho and Quinino\'s which take into account simple deterioration functions. The development is based on Ferguson\'s and Antoniak\'s papers to obtain a posteriori distribution for an unknown measure P, associated with an unknown distribution function F that represents the sequence of defective fractions, considering a prior mixture of Dirichlet Processes. The results are applied to the estimation of the distribution function F and the Bayes estimates are analised through some particular cases.
|
7 |
Um modelo Bayesiano semi-paramétrico para o monitoramento ``on-line\" de qualidade de Taguchi para atributos / A semi-parametric model for Taguchi´s On-Line Quality-Monitoring Procedure for AttributesMiriam Harumi Tsunemi 27 April 2009 (has links)
Este modelo contempla o cenário em que a sequência de frações não-conformes no decorrer de um ciclo do processo de produção aumenta gradativamente (situação comum, por exemplo, quando o desgaste de um equipamento é gradual), diferentemente dos modelos de Taguchi, Nayebpour e Woodall e Nandi e Sreehari (1997), que acomodam sequências de frações não-conformes assumindo no máximo três valores, e de Nandi e Sreehari (1999) e Trindade, Ho e Quinino (2007) que contemplam funções de degradação mais simples. O desenvolvimento é baseado nos trabalhos de Ferguson e Antoniak para o cálculo da distribuição a posteriori de uma medida P desconhecida, associada a uma função de distribuição F desconhecida que representa a sequência de frações não-conformes ao longo de um ciclo, supondo, a priori, mistura de Processos Dirichlet. A aplicação consiste na estimação da função de distribuição F e as estimativas de Bayes são analisadas através de alguns casos particulares / In this work, we propose an alternative model for Taguchi´s On-Line Quality-Monitoring Procedure for Attributes under a Bayesian nonparametric framework. This model may be applied to production processes the sequences of defective fractions during a cycle of which increase gradually (for example, when an equipment deteriorates little by little), differently from either Taguchi\'s, Nayebpour and Woodall\'s and Nandi and Sreehari\'s models that allow at most three values for the defective fraction or Nandi and Sreehari\'s and Trindade, Ho and Quinino\'s which take into account simple deterioration functions. The development is based on Ferguson\'s and Antoniak\'s papers to obtain a posteriori distribution for an unknown measure P, associated with an unknown distribution function F that represents the sequence of defective fractions, considering a prior mixture of Dirichlet Processes. The results are applied to the estimation of the distribution function F and the Bayes estimates are analised through some particular cases.
|
8 |
Bayesian models for DNA microarray data analysisLee, Kyeong Eun 29 August 2005 (has links)
Selection of signi?cant genes via expression patterns is important in a microarray problem. Owing to small sample size and large number of variables (genes), the selection process can be unstable. This research proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables in a regression setting and use a Bayesian mixture prior to perform the variable selection. Due to the binary nature of the data, the posterior distributions of the parameters are not in explicit form, and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the posterior distributions. The Bayesian model is ?exible enough to identify the signi?cant genes as well as to perform future predictions. The method is applied to cancer classi?cation via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify the set of signi?cant genes to classify BRCA1 and others. Microarray data can also be applied to survival models. We address the issue of how to reduce the dimension in building model by selecting signi?cant genes as well as assessing the estimated survival curves. Additionally, we consider the wellknown Weibull regression and semiparametric proportional hazards (PH) models for survival analysis. With microarray data, we need to consider the case where the number of covariates p exceeds the number of samples n. Speci?cally, for a given vector of response values, which are times to event (death or censored times) and p gene expressions (covariates), we address the issue of how to reduce the dimension by selecting the responsible genes, which are controlling the survival time. This approach enables us to estimate the survival curve when n << p. In our approach, rather than ?xing the number of selected genes, we will assign a prior distribution to this number. The approach creates additional ?exibility by allowing the imposition of constraints, such as bounding the dimension via a prior, which in e?ect works as a penalty. To implement our methodology, we use a Markov Chain Monte Carlo (MCMC) method. We demonstrate the use of the methodology with (a) di?use large B??cell lymphoma (DLBCL) complementary DNA (cDNA) data and (b) Breast Carcinoma data. Lastly, we propose a mixture of Dirichlet process models using discrete wavelet transform for a curve clustering. In order to characterize these time??course gene expresssions, we consider them as trajectory functions of time and gene??speci?c parameters and obtain their wavelet coe?cients by a discrete wavelet transform. We then build cluster curves using a mixture of Dirichlet process priors.
|
9 |
Bayesian models for DNA microarray data analysisLee, Kyeong Eun 29 August 2005 (has links)
Selection of signi?cant genes via expression patterns is important in a microarray problem. Owing to small sample size and large number of variables (genes), the selection process can be unstable. This research proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables in a regression setting and use a Bayesian mixture prior to perform the variable selection. Due to the binary nature of the data, the posterior distributions of the parameters are not in explicit form, and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the posterior distributions. The Bayesian model is ?exible enough to identify the signi?cant genes as well as to perform future predictions. The method is applied to cancer classi?cation via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify the set of signi?cant genes to classify BRCA1 and others. Microarray data can also be applied to survival models. We address the issue of how to reduce the dimension in building model by selecting signi?cant genes as well as assessing the estimated survival curves. Additionally, we consider the wellknown Weibull regression and semiparametric proportional hazards (PH) models for survival analysis. With microarray data, we need to consider the case where the number of covariates p exceeds the number of samples n. Speci?cally, for a given vector of response values, which are times to event (death or censored times) and p gene expressions (covariates), we address the issue of how to reduce the dimension by selecting the responsible genes, which are controlling the survival time. This approach enables us to estimate the survival curve when n << p. In our approach, rather than ?xing the number of selected genes, we will assign a prior distribution to this number. The approach creates additional ?exibility by allowing the imposition of constraints, such as bounding the dimension via a prior, which in e?ect works as a penalty. To implement our methodology, we use a Markov Chain Monte Carlo (MCMC) method. We demonstrate the use of the methodology with (a) di?use large B??cell lymphoma (DLBCL) complementary DNA (cDNA) data and (b) Breast Carcinoma data. Lastly, we propose a mixture of Dirichlet process models using discrete wavelet transform for a curve clustering. In order to characterize these time??course gene expresssions, we consider them as trajectory functions of time and gene??speci?c parameters and obtain their wavelet coe?cients by a discrete wavelet transform. We then build cluster curves using a mixture of Dirichlet process priors.
|
10 |
Non-Parametric Clustering of Multivariate Count DataTekumalla, Lavanya Sita January 2017 (has links) (PDF)
The focus of this thesis is models for non-parametric clustering of multivariate count data. While there has been significant work in Bayesian non-parametric modelling in the last decade, in the context of mixture models for real-valued data and some forms of discrete data such as multinomial-mixtures, there has been much less work on non-parametric clustering of Multi-variate Count Data. The main challenges in clustering multivariate counts include choosing a suitable multivariate distribution that adequately captures the properties of the data, for instance handling over-dispersed data or sparse multivariate data, at the same time leveraging the inherent dependency structure between dimensions and across instances to get meaningful clusters.
As the first contribution, this thesis explores extensions to the Multivariate Poisson distribution, proposing efficient algorithms for non-parametric clustering of multivariate count data. While Poisson is the most popular distribution for count modelling, the Multivariate Poisson often leads to intractable inference and a suboptimal t of the data. To address this, we introduce a family of models based on the Sparse-Multivariate Poisson, that exploit the inherent sparsity in multivariate data, reducing the number of latent variables in the formulation of Multivariate Poisson leading to a better t and more efficient inference. We explore Dirichlet process mixture model extensions and temporal non-parametric extensions to models based on the Sparse Multivariate Poisson for practical use of Poisson based models for non-parametric clustering of multivariate counts in real-world applications. As a second contribution, this thesis addresses moving beyond the limitations of Poisson based models for non-parametric clustering, for instance in handling over dispersed data or data with negative correlations. We explore, for the first time, marginal independent inference techniques based on the Gaussian Copula for multivariate count data in the Dirichlet Process mixture model setting. This enables non-parametric clustering of multivariate counts without limiting assumptions that usually restrict the marginal to belong to a particular family, such as the Poisson or the negative-binomial. This inference technique can also work for mixed data (combination of counts, binary and continuous data) enabling Bayesian non-parametric modelling to be used for a wide variety of data types. As the third contribution, this thesis addresses modelling a wide range of more complex dependencies such as asymmetric and tail dependencies during non-parametric clustering of multivariate count data with Vine Copula based Dirichlet process mixtures. While vine copula inference has been well explored for continuous data, it is still a topic of active research for multivariate counts and mixed multivariate data. Inference for multivariate counts and mixed data is a hard problem owing to ties that arise with discrete marginal. An efficient marginal independent inference approach based on extended rank likelihood, based on recent work in the statistics literature, is proposed in this thesis, extending the use vines for multivariate counts and mixed data in practical clustering scenarios.
This thesis also explores the novel systems application of Bulk Cache Preloading by analysing I/O traces though predictive models for temporal non-parametric clustering of multivariate count data. State of the art techniques in the caching domain are limited to exploiting short-range correlations in memory accesses at the milli-second granularity or smaller and cannot leverage long range correlations in traces. We explore for the first time, Bulk Cache Preloading, the process of pro-actively predicting data to load into cache, minutes or hours before the actual request from the application, by leveraging longer range correlation at the granularity of minutes or hours. This enables the development of machine learning techniques tailored for caching due to relaxed timing constraints. Our approach involves a data aggregation process, converting I/O traces into a temporal sequence of multivariate counts, that we analyse with the temporal non-parametric clustering models proposed in this thesis. While the focus of our thesis is models for non-parametric clustering for discrete data, particularly multivariate counts, we also hope our work on bulk cache preloading paves the way to more inter-disciplinary research for using data mining techniques in the systems domain.
As an additional contribution, this thesis addresses multi-level non-parametric admixture modelling for discrete data in the form of grouped categorical data, such as document collections. Non-parametric clustering for topic modelling in document collections, where a document is as-associated with an unknown number of semantic themes or topics, is well explored with admixture models such as the Hierarchical Dirichlet Process. However, there exist scenarios, where a doc-ument requires being associated with themes at multiple levels, where each theme is itself an admixture over themes at the previous level, motivating the need for multilevel admixtures. Consider the example of non-parametric entity-topic modelling of simultaneously learning entities and topics from document collections. This can be realized by modelling a document as an admixture over entities while entities could themselves be modeled as admixtures over topics. We propose the nested Hierarchical Dirichlet Process to address this gap and apply a two level version of our model to automatically learn author entities and topics from research corpora.
|
Page generated in 0.1038 seconds