11 |
Non-Parametric Clustering of Multivariate Count DataTekumalla, Lavanya Sita January 2017 (has links) (PDF)
The focus of this thesis is models for non-parametric clustering of multivariate count data. While there has been significant work in Bayesian non-parametric modelling in the last decade, in the context of mixture models for real-valued data and some forms of discrete data such as multinomial-mixtures, there has been much less work on non-parametric clustering of Multi-variate Count Data. The main challenges in clustering multivariate counts include choosing a suitable multivariate distribution that adequately captures the properties of the data, for instance handling over-dispersed data or sparse multivariate data, at the same time leveraging the inherent dependency structure between dimensions and across instances to get meaningful clusters.
As the first contribution, this thesis explores extensions to the Multivariate Poisson distribution, proposing efficient algorithms for non-parametric clustering of multivariate count data. While Poisson is the most popular distribution for count modelling, the Multivariate Poisson often leads to intractable inference and a suboptimal t of the data. To address this, we introduce a family of models based on the Sparse-Multivariate Poisson, that exploit the inherent sparsity in multivariate data, reducing the number of latent variables in the formulation of Multivariate Poisson leading to a better t and more efficient inference. We explore Dirichlet process mixture model extensions and temporal non-parametric extensions to models based on the Sparse Multivariate Poisson for practical use of Poisson based models for non-parametric clustering of multivariate counts in real-world applications. As a second contribution, this thesis addresses moving beyond the limitations of Poisson based models for non-parametric clustering, for instance in handling over dispersed data or data with negative correlations. We explore, for the first time, marginal independent inference techniques based on the Gaussian Copula for multivariate count data in the Dirichlet Process mixture model setting. This enables non-parametric clustering of multivariate counts without limiting assumptions that usually restrict the marginal to belong to a particular family, such as the Poisson or the negative-binomial. This inference technique can also work for mixed data (combination of counts, binary and continuous data) enabling Bayesian non-parametric modelling to be used for a wide variety of data types. As the third contribution, this thesis addresses modelling a wide range of more complex dependencies such as asymmetric and tail dependencies during non-parametric clustering of multivariate count data with Vine Copula based Dirichlet process mixtures. While vine copula inference has been well explored for continuous data, it is still a topic of active research for multivariate counts and mixed multivariate data. Inference for multivariate counts and mixed data is a hard problem owing to ties that arise with discrete marginal. An efficient marginal independent inference approach based on extended rank likelihood, based on recent work in the statistics literature, is proposed in this thesis, extending the use vines for multivariate counts and mixed data in practical clustering scenarios.
This thesis also explores the novel systems application of Bulk Cache Preloading by analysing I/O traces though predictive models for temporal non-parametric clustering of multivariate count data. State of the art techniques in the caching domain are limited to exploiting short-range correlations in memory accesses at the milli-second granularity or smaller and cannot leverage long range correlations in traces. We explore for the first time, Bulk Cache Preloading, the process of pro-actively predicting data to load into cache, minutes or hours before the actual request from the application, by leveraging longer range correlation at the granularity of minutes or hours. This enables the development of machine learning techniques tailored for caching due to relaxed timing constraints. Our approach involves a data aggregation process, converting I/O traces into a temporal sequence of multivariate counts, that we analyse with the temporal non-parametric clustering models proposed in this thesis. While the focus of our thesis is models for non-parametric clustering for discrete data, particularly multivariate counts, we also hope our work on bulk cache preloading paves the way to more inter-disciplinary research for using data mining techniques in the systems domain.
As an additional contribution, this thesis addresses multi-level non-parametric admixture modelling for discrete data in the form of grouped categorical data, such as document collections. Non-parametric clustering for topic modelling in document collections, where a document is as-associated with an unknown number of semantic themes or topics, is well explored with admixture models such as the Hierarchical Dirichlet Process. However, there exist scenarios, where a doc-ument requires being associated with themes at multiple levels, where each theme is itself an admixture over themes at the previous level, motivating the need for multilevel admixtures. Consider the example of non-parametric entity-topic modelling of simultaneously learning entities and topics from document collections. This can be realized by modelling a document as an admixture over entities while entities could themselves be modeled as admixtures over topics. We propose the nested Hierarchical Dirichlet Process to address this gap and apply a two level version of our model to automatically learn author entities and topics from research corpora.
|
12 |
Modélisation et utilisation des erreurs de pseudodistances GNSS en environnement transport pour l’amélioration des performances de localisation / Modeling and use of GNSS pseudorange errors in transport environment to enhance the localization performancesViandier, Nicolas 07 June 2011 (has links)
Les GNSS sont désormais largement présents dans le domaine des transports. Actuellement, la communauté scientifique désire développer des applications nécessitant une grande précision, disponibilité et intégrité.Ces systèmes offrent un service de position continu. Les performances sont définies par les paramètres du système mais également par l’environnement de propagation dans lequel se propagent les signaux. Les caractéristiques de propagation dans l’atmosphère sont connues. En revanche, il est plus difficile de prévoir l’impact de l’environnement proche de l’antenne, composé d’obstacles urbains. L’axe poursuivit par le LEOST et le LAGIS consiste à appréhender l’environnement et à utiliser cette information en complément de l’information GNSS. Cette approche vise à réduire le nombre de capteurs et ainsi la complexité du système et son coût. Les travaux de recherche menés dans le cadre de cette thèse permettent principalement de proposer des modélisations d'erreur de pseudodistances et des modélisations de l'état de réception encore plus réalistes. Après une étape de caractérisation de l’erreur, plusieurs modèles d’erreur de pseudodistance sont proposés. Ces modèles sont le mélange fini de gaussiennes et le mélange de processus de Dirichlet. Les paramètres du modèle sont estimés conjointement au vecteur d’état contenant la position grâce à une solution de filtrage adaptée comme le filtre particulaire Rao-Blackwellisé. L’évolution du modèle de bruit permet de s'adapter à l’environnement et donc de fournir une localisation plus précise. Les différentes étapes des travaux réalisés dans cette thèse ont été testées et validées sur données de simulation et réelles. / Today, the GNSS are largely present in the transport field. Currently, the scientific community aims to develop transport applications with a high accuracy, availability and integrity. These systems offer a continuous positioning service. Performances are defined by the system parameters but also by signal environment propagation. The atmosphere propagation characteristics are well known. However, it is more difficult to anticipate and analyze the impact of the propagation environment close to the antenna which can be composed, for instance, of urban obstacles or vegetation.Since several years, the LEOST and the LAGIS research axes are driven by the understanding of the propagation environment and its use as supplementary information to help the GNSS receiver to be more pertinent. This approach aims to reduce the number of sensors in the localisation system, and consequently reduces its complexity and cost. The work performed in this thesis is devoted to provide more realistic pseudorange error models and reception channel model. After, a step of observation error characterization, several pseudorange error models have been proposed. These models are the finite gaussian mixture model and the Dirichlet process mixture. The model parameters are then estimated jointly with the state vector containing position by using adapted filtering solution like the Rao-Blackwellized particle filter. The noise model evolution allows adapting to an urban environment and consequently providing a position more accurate.Each step of this work has been tested and evaluated on simulation data and real data.
|
13 |
Estimation Bayésienne non Paramétrique de Systèmes Dynamiques en Présence de Bruits Alpha-Stables / Nonparametric Bayesian Estimition of Dynamical Systems in the Presence of Alpha-Stable NoiseJaoua, Nouha 06 June 2013 (has links)
Dans un nombre croissant d'applications, les perturbations rencontrées s'éloignent fortement des modèles classiques qui les modélisent par une gaussienne ou un mélange de gaussiennes. C'est en particulier le cas des bruits impulsifs que nous rencontrons dans plusieurs domaines, notamment celui des télécommunications. Dans ce cas, une modélisation mieux adaptée peut reposer sur les distributions alpha-stables. C'est dans ce cadre que s'inscrit le travail de cette thèse dont l'objectif est de concevoir de nouvelles méthodes robustes pour l'estimation conjointe état-bruit dans des environnements impulsifs. L'inférence est réalisée dans un cadre bayésien en utilisant les méthodes de Monte Carlo séquentielles. Dans un premier temps, cette problématique a été abordée dans le contexte des systèmes de transmission OFDM en supposant que les distorsions du canal sont modélisées par des distributions alpha-stables symétriques. Un algorithme de Monte Carlo séquentiel a été proposé pour l'estimation conjointe des symboles OFDM émis et des paramètres du bruit $\alpha$-stable. Ensuite, cette problématique a été abordée dans un cadre applicatif plus large, celui des systèmes non linéaires. Une approche bayésienne non paramétrique fondée sur la modélisation du bruit alpha-stable par des mélanges de processus de Dirichlet a été proposée. Des filtres particulaires basés sur des densités d'importance efficaces sont développés pour l'estimation conjointe du signal et des densités de probabilité des bruits / In signal processing literature, noise's sources are often assumed to be Gaussian. However, in many fields the conventional Gaussian noise assumption is inadequate and can lead to the loss of resolution and/or accuracy. This is particularly the case of noise that exhibits impulsive nature. The latter is found in several areas, especially telecommunications. $\alpha$-stable distributions are suitable for modeling this type of noise. In this context, the main focus of this thesis is to propose novel methods for the joint estimation of the state and the noise in impulsive environments. Inference is performed within a Bayesian framework using sequential Monte Carlo methods. First, this issue has been addressed within an OFDM transmission link assuming a symmetric alpha-stable model for channel distortions. For this purpose, a particle filter is proposed to include the joint estimation of the transmitted OFDM symbols and the noise parameters. Then, this problem has been tackled in the more general context of nonlinear dynamic systems. A flexible Bayesian nonparametric model based on Dirichlet Process Mixtures is introduced to model the alpha-stable noise. Moreover, sequential Monte Carlo filters based on efficient importance densities are implemented to perform the joint estimation of the state and the unknown measurement noise density
|
14 |
Exploring Single-molecule Heterogeneity and the Price of Cell SignalingWang, Tenglong 25 January 2022 (has links)
No description available.
|
Page generated in 0.0934 seconds