Global ETD Search

1	Přístupy k shlukování funkčních dat / Approaches to Functional Data Clustering Pešout, Pavel January 2007 (has links) Classification is a very common task in information processing and important problem in many sectors of science and industry. In the case of data measured as a function of a dependent variable such as time, the most used algorithms may not pattern each of the individual shapes properly, because they are interested only in the choiced measurements. For the reason, the presented paper focuses on the specific techniques that directly address the curve clustering problem and classifying new individuals. The main goal of this work is to develop alternative methodologies through the extension to various statistical approaches, consolidate already established algorithms, expose their modified forms fitted to demands of clustering issue and compare some efficient curve clustering methods thanks to reported extensive simulated data experiments. Last but not least is made, for the sake of executed experiments, comprehensive confrontation of effectual utility. Proposed clustering algorithms are based on two principles. Firstly, it is presumed that the set of trajectories may be probabilistic modelled as sequences of points generated from a finite mixture model consisting of regression components and hence the density-based clustering methods using the Maximum Likehood Estimation are investigated to recognize the most homogenous partitioning. Attention is paid to both the Maximum Likehood Approach, which assumes the cluster memberships to be some of the model parameters, and the probabilistic model with the iterative Expectation-Maximization algorithm, that assumes them to be random variables. To deal with the hidden data problem both Gaussian and less conventional gamma mixtures are comprehended with arranging for use in two dimensions. To cope with data with high variability within each subpopulation it is introduced two-level random effects regression mixture with the ability to let an individual vary from the template for its group. Secondly, it is taken advantage of well known K-Means algorithm applied to the estimated regression coefficients, though. The task of the optimal data fitting is devoted, because K-Means is not invariant to linear transformations. In order to overcome this problem it is suggested integrating clustering issue with the Markov Chain Monte Carlo approaches. What is more, this paper is concerned in functional discriminant analysis including linear and quadratic scores and their modified probabilistic forms by using random mixtures. Alike in K-Means it is shown how to apply Fisher's method of canonical scores to the regression coefficients. Experiments of simulated datasets are made that demonstrate the performance of all mentioned methods and enable to choose those with the most result and time efficiency. Considerable boon is the facture of new advisable application advances. Implementation is processed in Mathematica 4.0. Finally, the possibilities offered by the development of curve clustering algorithms in vast research areas of modern science are examined, like neurology, genome studies, speech and image recognition systems, and future investigation with incorporation with ubiquitous computing is not forbidden. Utility in economy is illustrated with executed application in claims analysis of some life insurance products. The goals of the thesis have been achieved.
2	Non-Parametric Clustering of Multivariate Count Data Tekumalla, Lavanya Sita January 2017 (has links) (PDF) The focus of this thesis is models for non-parametric clustering of multivariate count data. While there has been significant work in Bayesian non-parametric modelling in the last decade, in the context of mixture models for real-valued data and some forms of discrete data such as multinomial-mixtures, there has been much less work on non-parametric clustering of Multi-variate Count Data. The main challenges in clustering multivariate counts include choosing a suitable multivariate distribution that adequately captures the properties of the data, for instance handling over-dispersed data or sparse multivariate data, at the same time leveraging the inherent dependency structure between dimensions and across instances to get meaningful clusters. As the first contribution, this thesis explores extensions to the Multivariate Poisson distribution, proposing efficient algorithms for non-parametric clustering of multivariate count data. While Poisson is the most popular distribution for count modelling, the Multivariate Poisson often leads to intractable inference and a suboptimal t of the data. To address this, we introduce a family of models based on the Sparse-Multivariate Poisson, that exploit the inherent sparsity in multivariate data, reducing the number of latent variables in the formulation of Multivariate Poisson leading to a better t and more efficient inference. We explore Dirichlet process mixture model extensions and temporal non-parametric extensions to models based on the Sparse Multivariate Poisson for practical use of Poisson based models for non-parametric clustering of multivariate counts in real-world applications. As a second contribution, this thesis addresses moving beyond the limitations of Poisson based models for non-parametric clustering, for instance in handling over dispersed data or data with negative correlations. We explore, for the first time, marginal independent inference techniques based on the Gaussian Copula for multivariate count data in the Dirichlet Process mixture model setting. This enables non-parametric clustering of multivariate counts without limiting assumptions that usually restrict the marginal to belong to a particular family, such as the Poisson or the negative-binomial. This inference technique can also work for mixed data (combination of counts, binary and continuous data) enabling Bayesian non-parametric modelling to be used for a wide variety of data types. As the third contribution, this thesis addresses modelling a wide range of more complex dependencies such as asymmetric and tail dependencies during non-parametric clustering of multivariate count data with Vine Copula based Dirichlet process mixtures. While vine copula inference has been well explored for continuous data, it is still a topic of active research for multivariate counts and mixed multivariate data. Inference for multivariate counts and mixed data is a hard problem owing to ties that arise with discrete marginal. An efficient marginal independent inference approach based on extended rank likelihood, based on recent work in the statistics literature, is proposed in this thesis, extending the use vines for multivariate counts and mixed data in practical clustering scenarios. This thesis also explores the novel systems application of Bulk Cache Preloading by analysing I/O traces though predictive models for temporal non-parametric clustering of multivariate count data. State of the art techniques in the caching domain are limited to exploiting short-range correlations in memory accesses at the milli-second granularity or smaller and cannot leverage long range correlations in traces. We explore for the first time, Bulk Cache Preloading, the process of pro-actively predicting data to load into cache, minutes or hours before the actual request from the application, by leveraging longer range correlation at the granularity of minutes or hours. This enables the development of machine learning techniques tailored for caching due to relaxed timing constraints. Our approach involves a data aggregation process, converting I/O traces into a temporal sequence of multivariate counts, that we analyse with the temporal non-parametric clustering models proposed in this thesis. While the focus of our thesis is models for non-parametric clustering for discrete data, particularly multivariate counts, we also hope our work on bulk cache preloading paves the way to more inter-disciplinary research for using data mining techniques in the systems domain. As an additional contribution, this thesis addresses multi-level non-parametric admixture modelling for discrete data in the form of grouped categorical data, such as document collections. Non-parametric clustering for topic modelling in document collections, where a document is as-associated with an unknown number of semantic themes or topics, is well explored with admixture models such as the Hierarchical Dirichlet Process. However, there exist scenarios, where a doc-ument requires being associated with themes at multiple levels, where each theme is itself an admixture over themes at the previous level, motivating the need for multilevel admixtures. Consider the example of non-parametric entity-topic modelling of simultaneously learning entities and topics from document collections. This can be realized by modelling a document as an admixture over entities while entities could themselves be modeled as admixtures over topics. We propose the nested Hierarchical Dirichlet Process to address this gap and apply a two level version of our model to automatically learn author entities and topics from research corpora. Multivariate Count Data Clustering Mixture Models Non-parametric Clustering Bulk Cache Preloading Dirichlet Process Mixture Models Spatio-Temporal Data Aggregation Sparse Multivariate Poisson MultiVariate Poisson (MVP) Copulas Nested Hierarchical Dirichlet Processes Dirichlet Process Mixtures Sparse-Multivariate Poisson Dirichlet Process Mixture Model Computer Science

Search results

Přístupy k shlukování funkčních dat / Approaches to Functional Data Clustering

Non-Parametric Clustering of Multivariate Count Data