Spelling suggestions: "subject:"discrete data"" "subject:"iscrete data""
1 |
On Multi-Scale Refinement of Discrete DataDehghani Tafti, Pouya 10 1900 (has links)
<p> It is possible to interpret multi-resolution analysis from both Fourier-domain and temporal/spatial domain stand-points. While a Fourier-domain interpretation helps in designing a powerful machinery for multi-resolution refinement on regular point-sets and lattices, most of its techniques cannot be directly generalized to the case of irregular sampling. Therefore, in this thesis we provide a new definition and formulation of multi-resolution refinement, based on a temporal/spatial-domain understanding, that is general enough to allow multi-resolution approximation of different spaces of functions by processing samples (or observations) that can be irregularly distributed or even obtained using different sampling methods. We then continue to provide a construction for designing and implementing classes of refinement schemes in these general settings. The framework for multi-resolution refinement that we discuss includes and extends the existing mathematical machinery for multi-resolution analysis; and the suggested construction unifies many of the schemes currently in use, and, more importantly, allows designing schemes for many new settings. </p> / Thesis / Master of Applied Science (MASc)
|
2 |
Modelos para dados de contagem com superdispersão: uma aplicação em um experimento agronômico / Models for count data with overdispersion: application in an agronomic experimentBatista, Douglas Toledo 26 June 2015 (has links)
O modelo de referência para dados de contagem é o modelo de Poisson. A principal característica do modelo de Poisson é a pressuposição de que a média e a variância são iguais. No entanto, essa relação de média-variância nem sempre ocorre em dados observacionais. Muitas vezes, a variância observada nos dados é maior do que a variância esperada, fenômeno este conhecido como superdispersão. O objetivo deste trabalho constitui-se na aplicação de modelos lineares generalizados, a fim de selecionar um modelo adequado para acomodar de forma satisfatória a superdispersão presente em dados de contagem. Os dados provêm de um experimento que objetivava avaliar e caracterizar os parâmetros envolvidos no florescimento de plantas adultas da laranjeira variedade \"x11\", enxertadas nos limoeiros das variedades \"Cravo\" e \"Swingle\". Primeiramente ajustou-se o modelo de Poisson com função de ligação canônica. Por meio da deviance, estatística X2 de Pearson e do gráfico half-normal plot observou-se forte evidência de superdispersão. Utilizou-se, então, como modelos alternativos ao Poisson, os modelos Binomial Negativo e Quase-Poisson. Verificou que o modelo Quase-Poisson foi o que melhor se ajustou aos dados, permitindo fazer inferências mais precisas e interpretações práticas para os parâmetros do modelo. / The reference model for count data is the Poisson model. The main feature of Poisson model is the assumption that mean and variance are equal. However, this mean-variance relationship rarely occurs in observational data. Often, the observed variance is greater than the expected variance, a phenomenon known as overdispersion. The aim of this work is the application of generalized linear models, in order to select an appropriated model to satisfactorily accommodate the overdispersion present in the data. The data come from an experiment that aimed to evaluate and characterize the parameters involved in the flowering of orange adult plants of the variety \"x11\" grafted on \"Cravo\" and \"Swingle\". First, the data were submitted to adjust by Poisson model with canonical link function. Using deviance, generalized Pearson chi-squared statistic and half-normal plots, it was possible to notice strong evidence of overdispersion. Thus, alternative models to Poisson were used such as the negative binomial and Quasi-Poisson models. The Quasi-Poisson model presented the best fit to the data, allowing more accurate inferences and practices interpretations for the parameters.
|
3 |
Modelos para dados de contagem com superdispersão: uma aplicação em um experimento agronômico / Models for count data with overdispersion: application in an agronomic experimentDouglas Toledo Batista 26 June 2015 (has links)
O modelo de referência para dados de contagem é o modelo de Poisson. A principal característica do modelo de Poisson é a pressuposição de que a média e a variância são iguais. No entanto, essa relação de média-variância nem sempre ocorre em dados observacionais. Muitas vezes, a variância observada nos dados é maior do que a variância esperada, fenômeno este conhecido como superdispersão. O objetivo deste trabalho constitui-se na aplicação de modelos lineares generalizados, a fim de selecionar um modelo adequado para acomodar de forma satisfatória a superdispersão presente em dados de contagem. Os dados provêm de um experimento que objetivava avaliar e caracterizar os parâmetros envolvidos no florescimento de plantas adultas da laranjeira variedade \"x11\", enxertadas nos limoeiros das variedades \"Cravo\" e \"Swingle\". Primeiramente ajustou-se o modelo de Poisson com função de ligação canônica. Por meio da deviance, estatística X2 de Pearson e do gráfico half-normal plot observou-se forte evidência de superdispersão. Utilizou-se, então, como modelos alternativos ao Poisson, os modelos Binomial Negativo e Quase-Poisson. Verificou que o modelo Quase-Poisson foi o que melhor se ajustou aos dados, permitindo fazer inferências mais precisas e interpretações práticas para os parâmetros do modelo. / The reference model for count data is the Poisson model. The main feature of Poisson model is the assumption that mean and variance are equal. However, this mean-variance relationship rarely occurs in observational data. Often, the observed variance is greater than the expected variance, a phenomenon known as overdispersion. The aim of this work is the application of generalized linear models, in order to select an appropriated model to satisfactorily accommodate the overdispersion present in the data. The data come from an experiment that aimed to evaluate and characterize the parameters involved in the flowering of orange adult plants of the variety \"x11\" grafted on \"Cravo\" and \"Swingle\". First, the data were submitted to adjust by Poisson model with canonical link function. Using deviance, generalized Pearson chi-squared statistic and half-normal plots, it was possible to notice strong evidence of overdispersion. Thus, alternative models to Poisson were used such as the negative binomial and Quasi-Poisson models. The Quasi-Poisson model presented the best fit to the data, allowing more accurate inferences and practices interpretations for the parameters.
|
4 |
Extending the Information Partition Function: Modeling Interaction Effects in Highly Multivariate, Discrete DataCannon, Paul C. 28 December 2007 (has links) (PDF)
Because of the huge amounts of data made available by the technology boom in the late twentieth century, new methods are required to turn data into usable information. Much of this data is categorical in nature, which makes estimation difficult in highly multivariate settings. In this thesis we review various multivariate statistical methods, discuss various statistical methods of natural language processing (NLP), and discuss a general class of models described by Erosheva (2002) called generalized mixed membership models. We then propose extensions of the information partition function (IPF) derived by Engler (2002), Oliphant (2003), and Tolley (2006) that will allow modeling of discrete, highly multivariate data in linear models. We report results of the modified IPF model on the World Health Organization's Survey on Global Aging (SAGE).
|
5 |
A Contrast Pattern based Clustering Algorithm for Categorical DataFore, Neil Koberlein 13 October 2010 (has links)
No description available.
|
6 |
Modélisation bayésienne des changements aux niches écologiques causés par le réchauffement climatiqueAkpoué, Blache Paul 05 1900 (has links)
Cette thèse présente des méthodes de traitement de données de comptage en particulier et des données discrètes en général. Il s'inscrit dans le cadre d'un projet stratégique du CRNSG, nommé CC-Bio, dont l'objectif est d'évaluer l'impact des changements climatiques sur la répartition des espèces animales et végétales.
Après une brève introduction aux notions de biogéographie et aux modèles linéaires mixtes généralisés aux chapitres 1 et 2 respectivement, ma thèse s'articulera autour de trois idées majeures.
Premièrement, nous introduisons au chapitre 3 une nouvelle forme de distribution dont les composantes ont pour distributions marginales des lois de Poisson ou des lois de Skellam. Cette nouvelle spécification permet d'incorporer de l'information pertinente sur la nature des corrélations entre toutes les composantes. De plus, nous présentons certaines propriétés de ladite distribution. Contrairement à la distribution multidimensionnelle de Poisson qu'elle généralise, celle-ci permet de traiter les variables avec des corrélations positives et/ou négatives. Une simulation permet d'illustrer les méthodes d'estimation dans le cas bidimensionnel. Les résultats obtenus par les méthodes bayésiennes par les chaînes de Markov par Monte Carlo (CMMC) indiquent un biais relatif assez faible de moins de 5% pour les coefficients de régression des moyennes contrairement à ceux du terme de covariance qui semblent un peu plus volatils.
Deuxièmement, le chapitre 4 présente une extension de la régression multidimensionnelle de Poisson avec des effets aléatoires ayant une densité gamma. En effet, conscients du fait que les données d'abondance des espèces présentent une forte dispersion, ce qui rendrait fallacieux les estimateurs et écarts types obtenus, nous privilégions une approche basée sur l'intégration par Monte Carlo grâce à l'échantillonnage préférentiel. L'approche demeure la même qu'au chapitre précédent, c'est-à-dire que l'idée est de simuler des variables latentes indépendantes et de se retrouver dans le cadre d'un modèle linéaire mixte généralisé (GLMM) conventionnel avec des effets aléatoires de densité gamma. Même si l'hypothèse d'une connaissance a priori des paramètres de dispersion semble trop forte, une analyse de sensibilité basée sur la qualité de l'ajustement permet de démontrer la robustesse de notre méthode.
Troisièmement, dans le dernier chapitre, nous nous intéressons à la définition et à la construction d'une mesure de concordance donc de corrélation pour les données augmentées en zéro par la modélisation de copules gaussiennes. Contrairement au tau de Kendall dont les valeurs se situent dans un intervalle dont les bornes varient selon la fréquence d'observations d'égalité entre les paires, cette mesure a pour avantage de prendre ses valeurs sur (-1;1). Initialement introduite pour modéliser les corrélations entre des variables continues, son extension au cas discret implique certaines restrictions. En effet, la nouvelle mesure pourrait être interprétée comme la corrélation entre les variables aléatoires continues dont la discrétisation constitue nos observations discrètes non négatives. Deux méthodes d'estimation des modèles augmentés en zéro seront présentées dans les contextes fréquentiste et bayésien basées respectivement sur le maximum de vraisemblance et l'intégration de Gauss-Hermite. Enfin, une étude de simulation permet de montrer la robustesse et les limites de notre approche. / This thesis presents some estimation methods and algorithms to analyse count data in particular and discrete data in general. It is also part of an NSERC strategic project, named CC-Bio, which aims to assess the impact of climate change on the distribution of plant and animal species in Québec.
After a brief introduction to the concepts and definitions of biogeography and those relative to the generalized linear mixed models in chapters 1 and 2 respectively, my thesis will focus on three major and new ideas.
First, we introduce in chapter 3 a new form of distribution whose components have marginal distribution Poisson or Skellam. This new specification allows to incorporate relevant information about the nature of the correlations between all the components. In addition, we present some properties of this probability distribution function. Unlike the multivariate Poisson distribution initially introduced, this generalization enables to handle both positive and negative correlations. A simulation study illustrates the estimation in the two-dimensional case. The results obtained by Bayesian methods via Monte Carlo Markov chain (MCMC) suggest a fairly low relative bias of less than 5% for the regression coefficients of the mean. However, those of the covariance term seem a bit more volatile.
Later, the chapter 4 presents an extension of the multivariate Poisson regression with random effects having a gamma density. Indeed, aware that the abundance data of species have a high dispersion, which would make misleading estimators and standard deviations, we introduce an approach based on integration by Monte Carlo sampling. The approach remains the same as in the previous chapter. Indeed, the objective is to simulate independent latent variables to transform the multivariate problem estimation in many generalized linear mixed models (GLMM) with conventional gamma random effects density. While the assumption of knowledge a priori dispersion parameters seems too strong and not realistic, a sensitivity analysis based on a measure of goodness of fit is used to demonstrate the robustness of the method.
Finally, in the last chapter, we focus on the definition and construction of a measure of concordance or a correlation measure for some zeros augmented count data with Gaussian copula models. In contrast to Kendall's tau whose values lie in an interval whose bounds depend on the frequency of ties observations, this measure has the advantage of taking its values on the interval (-1, 1). Originally introduced to model the correlations between continuous variables, its extension to the discrete case implies certain restrictions and its values are no longer in the entire interval (-1,1) but only on a subset. Indeed, the new measure could be interpreted as the correlation between continuous random variables before being transformed to discrete variables considered as our discrete non negative observations. Two methods of estimation based on integration via Gaussian quadrature and maximum likelihood are presented. Some simulation studies show the robustness and the limits of our approach.
|
7 |
Influence, information and item response theory in discrete data analysisMagis, David 04 May 2007 (has links)
The main purpose of this thesis is to consider usual statistical tests for discrete data and to present some recent developments around them. Contents can be divided into three parts.
In the first part we consider the general issue of misclassification and its impact on usual test results. A suggested diagnostic examination of the misclassification process leads to simple and direct investigation tools to determine whether conclusions are very sensitive to classification errors. An additional probabilistic approach is presented, in order to refine the discussion in terms of the risk of getting contradictory conclusions whenever misclassified data occur.
In the second part we propose a general approach to deal with the issue of multiple sub-testing procedures. In particular, when the null hypothesis is rejected, we show that usual re-applications of the test to selected parts of the data can provide non-consistency problems. The method we discuss is based on the concept of decisive subsets, set as the smallest number of categories being sufficient to reject the null hypothesis, whatever the counts of the remaining categories. In this framework, we present an iterative step-by-step detection process based on successive interval building and category count comparison. Several examples highlight the gain our method can bring with respect to classical approaches.
The third and last part is consecrated to the framework of item response theory, a field of psychometrics. After a short introduction to that topic, we propose first two enhanced iterative estimators of proficiency. Several theoretical properties and simulation results indicate that these methods ameliorate the usual Bayesian estimators in terms of bias, among others. Furthermore, we propose to study the link between response pattern misfit and subject's variability (the latter as individual latent trait). More precisely, we present "maximum likelihood"-based joint estimators of subject's parameters (ability and variability); several simulations suggest that enhanced estimators also have major gain (with respect to classical ones), mainly in terms of estimator's bias.
|
8 |
Modélisation bayésienne des changements aux niches écologiques causés par le réchauffement climatiqueAkpoué, Blache Paul 05 1900 (has links)
Cette thèse présente des méthodes de traitement de données de comptage en particulier et des données discrètes en général. Il s'inscrit dans le cadre d'un projet stratégique du CRNSG, nommé CC-Bio, dont l'objectif est d'évaluer l'impact des changements climatiques sur la répartition des espèces animales et végétales.
Après une brève introduction aux notions de biogéographie et aux modèles linéaires mixtes généralisés aux chapitres 1 et 2 respectivement, ma thèse s'articulera autour de trois idées majeures.
Premièrement, nous introduisons au chapitre 3 une nouvelle forme de distribution dont les composantes ont pour distributions marginales des lois de Poisson ou des lois de Skellam. Cette nouvelle spécification permet d'incorporer de l'information pertinente sur la nature des corrélations entre toutes les composantes. De plus, nous présentons certaines propriétés de ladite distribution. Contrairement à la distribution multidimensionnelle de Poisson qu'elle généralise, celle-ci permet de traiter les variables avec des corrélations positives et/ou négatives. Une simulation permet d'illustrer les méthodes d'estimation dans le cas bidimensionnel. Les résultats obtenus par les méthodes bayésiennes par les chaînes de Markov par Monte Carlo (CMMC) indiquent un biais relatif assez faible de moins de 5% pour les coefficients de régression des moyennes contrairement à ceux du terme de covariance qui semblent un peu plus volatils.
Deuxièmement, le chapitre 4 présente une extension de la régression multidimensionnelle de Poisson avec des effets aléatoires ayant une densité gamma. En effet, conscients du fait que les données d'abondance des espèces présentent une forte dispersion, ce qui rendrait fallacieux les estimateurs et écarts types obtenus, nous privilégions une approche basée sur l'intégration par Monte Carlo grâce à l'échantillonnage préférentiel. L'approche demeure la même qu'au chapitre précédent, c'est-à-dire que l'idée est de simuler des variables latentes indépendantes et de se retrouver dans le cadre d'un modèle linéaire mixte généralisé (GLMM) conventionnel avec des effets aléatoires de densité gamma. Même si l'hypothèse d'une connaissance a priori des paramètres de dispersion semble trop forte, une analyse de sensibilité basée sur la qualité de l'ajustement permet de démontrer la robustesse de notre méthode.
Troisièmement, dans le dernier chapitre, nous nous intéressons à la définition et à la construction d'une mesure de concordance donc de corrélation pour les données augmentées en zéro par la modélisation de copules gaussiennes. Contrairement au tau de Kendall dont les valeurs se situent dans un intervalle dont les bornes varient selon la fréquence d'observations d'égalité entre les paires, cette mesure a pour avantage de prendre ses valeurs sur (-1;1). Initialement introduite pour modéliser les corrélations entre des variables continues, son extension au cas discret implique certaines restrictions. En effet, la nouvelle mesure pourrait être interprétée comme la corrélation entre les variables aléatoires continues dont la discrétisation constitue nos observations discrètes non négatives. Deux méthodes d'estimation des modèles augmentés en zéro seront présentées dans les contextes fréquentiste et bayésien basées respectivement sur le maximum de vraisemblance et l'intégration de Gauss-Hermite. Enfin, une étude de simulation permet de montrer la robustesse et les limites de notre approche. / This thesis presents some estimation methods and algorithms to analyse count data in particular and discrete data in general. It is also part of an NSERC strategic project, named CC-Bio, which aims to assess the impact of climate change on the distribution of plant and animal species in Québec.
After a brief introduction to the concepts and definitions of biogeography and those relative to the generalized linear mixed models in chapters 1 and 2 respectively, my thesis will focus on three major and new ideas.
First, we introduce in chapter 3 a new form of distribution whose components have marginal distribution Poisson or Skellam. This new specification allows to incorporate relevant information about the nature of the correlations between all the components. In addition, we present some properties of this probability distribution function. Unlike the multivariate Poisson distribution initially introduced, this generalization enables to handle both positive and negative correlations. A simulation study illustrates the estimation in the two-dimensional case. The results obtained by Bayesian methods via Monte Carlo Markov chain (MCMC) suggest a fairly low relative bias of less than 5% for the regression coefficients of the mean. However, those of the covariance term seem a bit more volatile.
Later, the chapter 4 presents an extension of the multivariate Poisson regression with random effects having a gamma density. Indeed, aware that the abundance data of species have a high dispersion, which would make misleading estimators and standard deviations, we introduce an approach based on integration by Monte Carlo sampling. The approach remains the same as in the previous chapter. Indeed, the objective is to simulate independent latent variables to transform the multivariate problem estimation in many generalized linear mixed models (GLMM) with conventional gamma random effects density. While the assumption of knowledge a priori dispersion parameters seems too strong and not realistic, a sensitivity analysis based on a measure of goodness of fit is used to demonstrate the robustness of the method.
Finally, in the last chapter, we focus on the definition and construction of a measure of concordance or a correlation measure for some zeros augmented count data with Gaussian copula models. In contrast to Kendall's tau whose values lie in an interval whose bounds depend on the frequency of ties observations, this measure has the advantage of taking its values on the interval (-1, 1). Originally introduced to model the correlations between continuous variables, its extension to the discrete case implies certain restrictions and its values are no longer in the entire interval (-1,1) but only on a subset. Indeed, the new measure could be interpreted as the correlation between continuous random variables before being transformed to discrete variables considered as our discrete non negative observations. Two methods of estimation based on integration via Gaussian quadrature and maximum likelihood are presented. Some simulation studies show the robustness and the limits of our approach.
|
9 |
Machine Learning Approaches to Reveal Discrete Signals in Gene ExpressionChanglin Wan (12450321) 24 April 2022 (has links)
<p>Gene expression is an intricate process that determines different cell types and functions in metazoans, where most of its regulation is communicated through discrete signals, like whether the DNA helix is open, whether an enzyme binds with its target, etc. Understanding the regulation signals of the selective expression process is essential to the full comprehension of biological mechanism and complicated biological systems. In this research, we seek to reveal the discrete signals in gene expression by utilizing novel machine learning approaches. Specifically, we focus on two types of data chromatin conformation capture (3C) and single cell RNA sequencing (scRNA-seq). To identify potential regulators, we utilize a new hypergraph neural network to predict genome interactions, where we find the gene co-regulation may result from the shared enhancer element. To reveal the discrete expression state from scRNA-seq data, we propose a novel model called LTMG that considered the biological noise and showed better goodness of fitting compared with existing models. Next, we applied Boolean matrix factorization to find the co-regulation modules from the identified expression states, where we revealed the general property in cancer cells across different patients. Lastly, to find more reliable modules, we analyze the bias in the data and proposed BIND, the first algorithm to quantify the column- and row-wise bias in binary matrix.</p>
|
10 |
Probabilistic Modeling of Multi-relational and Multivariate Discrete DataWu, Hao 07 February 2017 (has links)
Modeling and discovering knowledge from multi-relational and multivariate discrete data is a crucial task that arises in many research and application domains, e.g. text mining, intelligence analysis, epidemiology, social science, etc. In this dissertation, we study and address three problems involving the modeling of multi-relational discrete data and multivariate multi-response count data, viz. (1) discovering surprising patterns from multi-relational data, (2) constructing a generative model for multivariate categorical data, and (3) simultaneously modeling multivariate multi-response count data and estimating covariance structures between multiple responses.
To discover surprising multi-relational patterns, we first study the ``where do I start?'' problem originating from intelligence analysis. By studying nine methods with origins in association analysis, graph metrics, and probabilistic modeling, we identify several classes of algorithmic strategies that can supply starting points to analysts, and thus help to discover interesting multi-relational patterns from datasets. To actually mine for interesting multi-relational patterns, we represent the multi-relational patterns as dense and well-connected chains of biclusters over multiple relations, and model the discrete data by the maximum entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery.
To build a generative model for multivariate categorical data, we apply the maximum entropy principle to propose a categorical maximum entropy model such that in a statistically well-founded way we can optimally use given prior information about the data, and are unbiased otherwise. Generally, inferring the maximum entropy model could be infeasible in practice. Here, we leverage the structure of the categorical data space to design an efficient model inference algorithm to estimate the categorical maximum entropy model, and we demonstrate how the proposed model is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and US census datasets, and demonstrate its feasibility using an epidemic simulation application.
Modeling data with multivariate count responses is a challenging problem due to the discrete nature of the responses. Existing methods for univariate count responses cannot be easily extended to the multivariate case since the dependency among multiple responses needs to be properly accounted for. To model multivariate data with multiple count responses, we propose a novel multivariate Poisson log-normal model (MVPLN). By simultaneously estimating the regression coefficients and inverse covariance matrix over the latent variables with an efficient Monte Carlo EM algorithm, the proposed model takes advantages of association among multiple count responses to improve the model prediction accuracy. Simulation studies and applications to real world data are conducted to systematically evaluate the performance of the proposed method in comparison with conventional methods. / Ph. D. / In this decade of big data, massive data of various types are generated every day from different research areas and industry sectors. Among all these types of data, text data, i.e. text documents, are important to many research and real world applications. One challenge faced when analyzing massive text data is which documents we should investigate first to initialize the analysis and how to identify stories and plots, if any, that hide inside the massive text documents. For example, in intelligence analysis, when analyzing intelligence documents, some common questions that analysts ask are ‘How is a suspect connected to the passenger manifest on this flight?’ and ‘How do distributed terrorist cells interface with each other?’. This is a crucial task so called storytelling. In the first half of this dissertation, we will study this problem and design mathematical models and computer algorithms to automatically identify useful information from text data to help analysts to discover hidden stories and plots from massive text documents. We also incorporate visual analytics techniques and design a visualization system to support human-in-the-loop exploratory data analysis so that analysts could interact with the algorithms and models iteratively to investigate given datasets.
In the second half of this dissertation, we study two problems that arise from the domain of public health. When epidemic of certain disease happens, e.g. flu seasons, public health officials need to make certain policies in advance to prevent or alleviate the epidemic. A data-driven approach would be to make such public health policies using simulation results and predictions based on historical data. One problem usually faced in epidemic simulation is that researchers would like to run simulations with real-world data so that the simulation results can be close to real-world scenarios but at the same time protect the private information of individuals. To solve this problem, we design and implement a mathematical model that could generate realistic sythetic population using U.S. Census Survey to help conduct the epidemic simulation. Using flus as an example, we also propose a mathematical model to study associations between different types of flus with the information collected from social media, like Twitter. We believe that identifying such associations between different types of flus will help officials to make appropriate public health policies.
|
Page generated in 0.0646 seconds