Global ETD Search

11	Topics in One-Way Supervised Biclustering Using Gaussian Mixture Models Wong, Monica January 2017 (has links) Cluster analysis identifies homogeneous groups that are relevant within a population. In model-based clustering, group membership is estimated using a parametric finite mixture model, commonly the mathematically tractable Gaussian mixture model. One-way clustering methods can be restrictive in cases where there are suspected relationships between the variables in each component, leading to the idea of biclustering, which refers to clustering both observations and variables simultaneously. When the relationships between the variables are known, biclustering becomes one-way supervised. To this end, this thesis focuses on a novel one-way supervised biclustering family based on the Gaussian mixture model. In cases where biclustering may be overestimating the number of components in the data, a model averaging technique utilizing Occam's window is applied to produce better clustering results. Automatic outlier detection is introduced into the biclustering family using mixtures of contaminated Gaussian mixture models. Algorithms for model-fitting and parameter estimation are presented for the techniques described in this thesis, and simulation and real data studies are used to assess their performance. / Thesis / Doctor of Philosophy (PhD) Biclustering One-way supervision Finite mixture models Model-based clustering
12	Clustering Response-Stressor Relationships in Ecological Studies Gao, Feng 31 July 2008 (has links) This research is motivated by an issue frequently encountered in water quality monitoring and ecological assessment. One concern for researchers and watershed resource managers is how the biological community in a watershed is affected by human activities. The conventional single model approach based on regression and logistic regression usually fails to adequately model the relationship between biological responses and environmental stressors since the study samples are collected over a large spatial region and the response-stressor relationships are usually weak in this situation. In this dissertation, we propose two alternative modeling approaches to partition the whole region of study into disjoint subregions and model the response-stressor relationships within subregions simultaneously. In our examples, these modeling approaches found stronger relationships within subregions and should help the resource managers improve impairment assessment and decision making. The first approach is an adjusted Bayesian classification and regression tree (ABCART). It is based on the Bayesian classification and regression tree approach (BCART) and is modified to accommodate spatial partitions in ecological studies. The second approach is a Voronoi diagram based partition approach. This approach uses the Voronoi diagram technique to randomly partition the whole region into subregions with predetermined minimum sample size. The optimal partition/cluster is selected by Monte Carlo simulation. We propose several model selection criteria for optimal partitioning and modeling according to the nature of the study and extend it to multivariate analysis to find the underlying structure of response-stressor relationships. We also propose a multivariate hotspot detection approach (MHDM) to find the region where the response-stressor relationship is the strongest according to an R-square-like criterion. Several sets of ecological data are studied in this dissertation to illustrate the implementation of the above partition modeling approaches. The findings from these studies are consistent with other studies. / Ph. D. model selection CCA RDA BCART Voronoi diagrams Model based clustering
13	Aspect Mining Using Model-Based Clustering Rand McFadden, Renata 01 January 2011 (has links) Legacy systems contain critical and complex business code that has been in use for a long time. This code is difficult to understand, maintain, and evolve, in large part due to crosscutting concerns: software system features, such as persistence, logging, and error handling, whose implementation is spread across multiple modules. Aspect-oriented techniques separate crosscutting concerns from the base code, using separate modules called aspects and, thus, simplifying the legacy code. Aspect mining techniques identify aspect candidates so that the legacy code can be refactored into aspects. This study investigated an automated aspect mining method in which a vector-space model clustering approach was used with model-based clustering. The vector-space model clustering approach has been researched for aspect mining using a number of different heuristic clustering methods and producing mixed results. Prior to this study, this model had not been researched with model-based algorithms, even though they have grown in popularity because they lend themselves to statistical analysis and show results that are as good as or better than heuristic clustering methods. This study investigated the effectiveness of model-based clustering for identifying aspects when compared against heuristic methods, such as k-means clustering and agglomerative hierarchical clustering, using six different vector-space models. The study's results indicated that model-based clustering can, in fact, be more effective than heuristic methods and showed good promise for aspect mining. In general, model-based algorithms performed better in not spreading the methods of the concerns across the multiple clusters but did not perform as well in not mixing multiple concerns in the same cluster. Model-based algorithms were also significantly better at partitioning the data such that, given an ordered list of clusters, fewer clusters and methods would need to be analyzed to find all the concerns. In addition, model-based algorithms automatically determined the optimal number of clusters, which was a great advantage over heuristic-based algorithms. Lastly, the study found that the new vector-space models performed better, relative to aspect mining, than previously defined vector-space models. aspect mining aspect-oriented programming crosscutting concerns model-based clustering vector-space model Computer Sciences
14	Mixture model cluster analysis under different covariance structures using information complexity Erar, Bahar 01 August 2011 (has links) In this thesis, a mixture-model cluster analysis technique under different covariance structures of the component densities is developed and presented, to capture the compactness, orientation, shape, and the volume of component clusters in one expert system to handle Gaussian high dimensional heterogeneous data sets to achieve flexibility in currently practiced cluster analysis techniques. Two approaches to parameter estimation are considered and compared; one using the Expectation-Maximization (EM) algorithm and another following a Bayesian framework using the Gibbs sampler. We develop and score several forms of the ICOMP criterion of Bozdogan (1994, 2004) as our fitness function; to choose the number of component clusters, to choose the correct component covariance matrix structure among nine candidate covariance structures, and to select the optimal parameters and the best fitting mixture-model. We demonstrate our approach on simulated datasets and a real large data set, focusing on early detection of breast cancer. We show that our approach improves the probability of classification error over the existing methods. Gaussian mixture model-based clustering information complexity Gibbs sampler eigenvalue decomposition Multivariate Analysis Statistical Models
15	Model-Based Clustering for Gene Expression and Change Patterns Jan, Yi-An 29 July 2011 (has links) It is important to study gene expression and change patterns over a time period because biologically related gene groups are likely to share similar patterns. In this study, similar gene expression and change patterns are found via model-based clustering method. Fourier and wavelet coefficients of gene expression data are used as the clustering variables. A two-stage model-based method is proposed for stepwise clustering of expression and change patterns. Simulation study is performed to investigate the effectiveness of the proposed methodology. Yeast cell cycle data are analyzed. Gene expression Model-based clustering Wavelet coefficients Fourier coefficients Yeast cell cycle data
16	Spatial stochastic processes for yield and reliability management with applications to nano electronics Hwang, Jung Yoon 17 February 2005 (has links) This study uses the spatial features of defects on the wafers to examine the detection and control of process variation in semiconductor fabrication. It applies spatial stochastic process to semiconductor yield modeling and the extrinsic reliabil- ity estimation model. New yield models of integrated circuits based on the spatial point process are established. The defect density which varies according to location on the wafer is modeled by the spatial nonhomogeneous Poisson process. And, in order to capture the variations in defect patterns between wafers, a random coeff- cient model and model-based clustering are applied. Model-based clustering is also applied to the fabrication process control for detecting these defect clusters that are generated by assignable causes. An extrinsic reliability model using defect data and a statistical defect growth model are developed based on the new yield model. yield modeling reliability integrated circuit spatial stochastic processes model-based clustering
17	Mixture models for ROC curve and spatio-temporal clustering Cheam, Amay SM January 2016 (has links) Finite mixture models have had a profound impact on the history of statistics, contributing to modelling heterogeneous populations, generalizing distributional assumptions, and lately, presenting a convenient framework for classification and clustering. A novel approach, via Gaussian mixture distribution, is introduced for modelling receiver operating characteristic curves. The absence of a closed-form for a functional form leads to employing the Monte Carlo method. This approach performs excellently compared to the existing methods when applied to real data. In practice, the data are often non-normal, atypical, or skewed. It is apparent that non-Gaussian distributions be introduced in order to better fit these data. Two non-Gaussian mixtures, i.e., t distribution and skew t distribution, are proposed and applied to real data. A novel mixture is presented to cluster spatial and temporal data. The proposed model defines each mixture component as a mixture of autoregressive polynomial with logistic links. The new model performs significantly better compared to the most well known model-based clustering techniques when applied to real data. / Thesis / Doctor of Philosophy (PhD) Finite mixture models ROC curve Spatio-temporal data Functional data Model-based clustering EM algorithm
18	Extending Growth Mixture Models and Handling Missing Values via Mixtures of Non-Elliptical Distributions Wei, Yuhong January 2017 (has links) Growth mixture models (GMMs) are used to model intra-individual change and inter-individual differences in change and to detect underlying group structure in longitudinal studies. Regularly, these models are fitted under the assumption of normality, an assumption that is frequently invalid. To this end, this thesis focuses on the development of novel non-elliptical growth mixture models to better fit real data. Two non-elliptical growth mixture models, via the multivariate skew-t distribution and the generalized hyperbolic distribution, are developed and applied to simulated and real data. Furthermore, these two non-elliptical growth mixture models are extended to accommodate missing values, which are near-ubiquitous in real data. Recently, finite mixtures of non-elliptical distributions have flourished and facilitated the flexible clustering of the data featuring longer tails and asymmetry. However, in practice, real data often have missing values, and so work in this direction is also pursued. A novel approach, via mixtures of the generalized hyperbolic distribution and mixtures of the multivariate skew-t distributions, is presented to handle missing values in mixture model-based clustering context. To increase parsimony, families of mixture models have been developed by imposing constraints on the component scale matrices whenever missing data occur. Next, a mixture of generalized hyperbolic factor analyzers model is also proposed to cluster high-dimensional data with different patterns of missing values. Two missingness indicator matrices are also introduced to ease the computational burden. The algorithms used for parameter estimation are presented, and the performance of the methods is illustrated on simulated and real data. / Thesis / Doctor of Philosophy (PhD) Growth Mixture Model Model-Based Clustering EM Algorithm Missing Data Finite Mixture Models
19	Statistical computation and inference for functional data analysis Jiang, Huijing 09 November 2010 (has links) My doctoral research dissertation focuses on two aspects of functional data analysis (FDA): FDA under spatial interdependence and FDA for multi-level data. The first part of my thesis focuses on developing modeling and inference procedure for functional data under spatial dependence. The methodology introduced in this part is motivated by a research study on inequities in accessibility to financial services. The first research problem in this part is concerned with a novel model-based method for clustering random time functions which are spatially interdependent. A cluster consists of time functions which are similar in shape. The time functions are decomposed into spatial global and time-dependent cluster effects using a semi-parametric model. We also assume that the clustering membership is a realization from a Markov random field. Under these model assumptions, we borrow information across curves from nearby locations resulting in enhanced estimation accuracy of the cluster effects and of the cluster membership. In a simulation study, we assess the estimation accuracy of our clustering algorithm under a series of settings: small number of time points, high noise level and varying dependence structures. Over all simulation settings, the spatial-functional clustering method outperforms existing model-based clustering methods. In the case study presented in this project, we focus on estimates and classifies service accessibility patterns varying over a large geographic area (California and Georgia) and over a period of 15 years. The focus of this study is on financial services but it generally applies to any other service operation. The second research project of this part studies an association analysis of space-time varying processes, which is rigorous, computational feasible and implementable with standard software. We introduce general measures to model different aspects of the temporal and spatial association between processes varying in space and time. Using a nonparametric spatiotemporal model, we show that the proposed association estimators are asymptotically unbiased and consistent. We complement the point association estimates with simultaneous confidence bands to assess the uncertainty in the point estimates. In a simulation study, we evaluate the accuracy of the association estimates with respect to the sample size as well as the coverage of the confidence bands. In the case study in this project, we investigate the association between service accessibility and income level. The primary objective of this association analysis is to assess whether there are significant changes in the income-driven equity of financial service accessibility over time and to identify potential under-served markets. The second part of the thesis discusses novel statistical methodology for analyzing multilevel functional data including a clustering method based on a functional ANOVA model and a spatio-temporal model for functional data with a nested hierarchical structure. In this part, I introduce and compare a series of clustering approaches for multilevel functional data. For brevity, I present the clustering methods for two-level data: multiple samples of random functions, each sample corresponding to a case and each random function within a sample/case corresponding to a measurement type. A cluster consists of cases which have similar within-case means (level-1 clustering) or similar between-case means (level-2 clustering). Our primary focus is to evaluate a model-based clustering to more straightforward hard clustering methods. The clustering model is based on a multilevel functional principal component analysis. In a simulation study, we assess the estimation accuracy of our clustering algorithm under a series of settings: small vs. moderate number of time points, high noise level and small number of measurement types. We demonstrate the applicability of the clustering analysis to a real data set consisting of time-varying sales for multiple products sold by a large retailer in the U.S. My ongoing research work in multilevel functional data analysis is developing a statistical model for estimating temporal and spatial associations of a series of time-varying variables with an intrinsic nested hierarchical structure. This work has a great potential in many real applications where the data are areal data collected from different data sources and over geographic regions of different spatial resolution. Service distribution equity Multi-level data Model-based clustering Spatio-temporal Functional data analysis Multilevel models (Statistics) Markov random fields
20	Mixture Model Averaging for Clustering Wei, Yuhong 30 April 2012 (has links) Model-based clustering is based on a finite mixture of distributions, where each mixture component corresponds to a different group, cluster, subpopulation, or part thereof. Gaussian mixture distributions are most often used. Criteria commonly used in choosing the number of components in a finite mixture model include the Akaike information criterion, Bayesian information criterion, and the integrated completed likelihood. The best model is taken to be the one with highest (or lowest) value of a given criterion. This approach is not reasonable because it is practically impossible to decide what to do when the difference between the best values of two models under such a criterion is ‘small’. Furthermore, it is not clear how such values should be calibrated in different situations with respect to sample size and random variables in the model, nor does it take into account the magnitude of the likelihood. It is, therefore, worthwhile considering a model-averaging approach. We consider an averaging of the top M mixture models and consider applications in clustering and classification. In the course of model averaging, the top M models often have different numbers of mixture components. Therefore, we propose a method of merging Gaussian mixture components in order to get the same number of clusters for the top M models. The idea is to list all the combinations of components for merging, and then choose the combination corresponding to the biggest adjusted Rand index (ARI) with the ‘reference model’. A weight is defined to quantify the importance of each model. The effectiveness of mixture model averaging for clustering is proved by simulated data and real data under the pgmm package, where the ARI from mixture model averaging for clustering are greater than the one of corresponding best model. The attractive feature of mixture model averaging is it’s computationally efficiency; it only uses the conditional membership probabilities. Herein, Gaussian mixture models are used but the approach could be applied effectively without modification to other mixture models. / Paul McNicholas mclust merging mixture component mixture model model averaging Model selection model-based clustering parameter estimation pgmm adjusted Rand index

Search results