Global ETD Search

21	Inferencia e diagnostico em modelos para dados de contagem com excesso de zeros / Inference and diagnostic in zero-inflated count data models Monzón Montoya, Alejandro Guillermo 13 August 2018 (has links) Orientador: Victor Hugo Lachos Davila / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matematica, Estatistica e Computação Cientifica / Made available in DSpace on 2018-08-13T06:59:43Z (GMT). No. of bitstreams: 1 MonzonMontoya_AlejandroGuillermo_M.pdf: 1229957 bytes, checksum: a4ad33aa2fe94f8744977822a1fd1362 (MD5) Previous issue date: 2009 / Resumo: Em análise de dados, muitas vezes encontramos dados de contagem onde a quantidade de zeros excede aquela esperada sob uma determinada distribuição, tal que não é possível fazer uso dos modelos de regressão usuais. Além disso, o excesso de zeros pode fazer com que exista sobredispersão nos dados. Neste trabalho são apresentados quatro tipos de modelos para dados de contagem inflacionados de zeros: o modelo Binomial (ZIB), o modelo Poisson (ZIP), o modelo binomial negativa (ZINB) e o modelo beta-binomial (ZIBB). Usa-se o algoritmo EM para obter estimativas de máxima verossimilhança dos parâmetros do modelo e usando a função de log-verossimilhança dos dados completos obtemos medidas de influência local baseadas na metodologia proposta por Zhu e Lee (2001) e Lee e Xu (2004). Também propomos como construir resíduos para os modelos ZIB e ZIP. Finalmente, as metodologias descritas são ilustradas pela análise de dados reais / Abstract: When analyzing count data sometimes a high frequency of extra zeros is observed and the usual regression analysis is not applicable. This feature may be accounted for by over-dispersion in the data set. In this work, four types of models for zero inflated count data are presented: viz., the zero-inflated Binomial (ZIB), the zero-inflated Poisson (ZIP), the zero-inflated Negative Binomial (ZINB) and the zero-inflated Beta-Binomial (ZIBB) regression models. We use the EM algorithm to obtain maximum likelihood estimates of the parameter of the proposed models and by using the complete data likelihood function we develop local influence measures following the approach of Zhu and Lee (2001) and Lee and Xu (2004). We also discuss the calculation of residuals for the ZIB and ZIP regression models with the aim of identifying atypical observations and/or model misspecification. Finally, results obtained for two real data sets are reported, illustrating the usefulness of the proposed methodology / Mestrado / Mestre em Estatística Dados de contagem Análise de regressão Influência local (Estatística) Resíduos Count data Regression analysis Local influence (Statistics) Residues
22	Equações de estimação generalizadas com resposta binomial negativa: modelando dados correlacionados de contagem com sobredispersão / Generalized estimating equations with negative binomial responses: modeling correlated count data with overdispersion Clarissa Cardoso Oesselmann 12 December 2016 (has links) Uma suposição muito comum na análise de modelos de regressão é a de respostas independentes. No entanto, quando trabalhamos com dados longitudinais ou agrupados essa suposição pode não fazer sentido. Para resolver esse problema existem diversas metodologias, e talvez a mais conhecida, no contexto não Gaussiano, é a metodologia de Equações de Estimação Generalizadas (EEGs), que possui similaridades com os Modelos Lineares Generalizados (MLGs). Essas similaridades envolvem a classificação do modelo em torno de distribuições da família exponencial e da especificação de uma função de variância. A única diferença é que nessa função também é inserida uma matriz trabalho que inclui a parametrização da estrutura de correlação dentro das unidades experimentais. O principal objetivo desta dissertação é estudar como esses modelos se comportam em uma situação específica, de dados de contagem com sobredispersão. Quando trabalhamos com MLGs esse problema é resolvido através do ajuste de um modelo com resposta binomial negativa (BN), e a ideia é a mesma para os modelos envolvendo EEGs. Essa dissertação visa rever as teorias existentes em EEGs no geral e para o caso específico quando a resposta marginal é BN, e além disso mostrar como essa metodologia se aplica na prática, com três exemplos diferentes de dados correlacionados com respostas de contagem. / An assumption that is common in the analysis of regression models is that of independent responses. However, when working with longitudinal or grouped data this assumption may not have sense. To solve this problem there are several methods, but perhaps the best known, in the non Gaussian context, is the one based on Generalized Estimating Equations (GEE), which has similarities with Generalized Linear Models (GLM). Such similarities involve the classification of the model around the exponential family and the specification of a variance function. The only diference is that in this function is also inserted a working correlation matrix concerning the correlations within the experimental units. The main objective of this dissertation is to study how these models behave in a specific situation, which is the one on count data with overdispersion. When we work with GLM this kind of problem is solved by setting a model with a negative binomial response (NB), and the idea is the same for the GEE methodology. This dissertation aims to review in general the GEE methodology and for the specific case when the responses follow marginal negative binomial distributions. In addition, we show how this methodology is applied in practice, with three examples of correlated data with count responses. Binomial negativa Dados de contagem Equações de estimação generalizadas Sobredispersão Count Data Generalized Estimating Equations Negative Binomial Overdispersion
23	Analyse statistique de données biologiques à haut débit / Statistical analysis of high-throughput biological data Aubert, Julie 07 February 2017 (has links) Les progrès technologiques des vingt dernières années ont permis l’avènement d'une biologie à haut-débit reposant sur l'obtention de données à grande échelle de façon automatique. Les statisticiens ont un rôle important à jouer dans la modélisation et l'analyse de ces données nombreuses, bruitées, parfois hétérogènes et recueillies à différentes échelles. Ce rôle peut être de plusieurs natures. Le statisticien peut proposer de nouveaux concepts ou méthodes inspirées par les questions posées par cette biologie. Il peut proposer une modélisation fine des phénomènes observés à l'aide de ces technologies. Et lorsque des méthodes existent et nécessitent seulement une adaptation, le rôle du statisticien peut être celui d'un expert, qui connaît les méthodes, leurs limites et avantages. Le travail présenté dans cette thèse se situe à l'interface entre mathématiques appliquées et biologie, et relève plutôt des deuxième et troisième type de rôles mentionnés.Dans une première partie, j’introduis différentes méthodes développées pour l'analyse de données biologiques à haut débit, basées sur des modèles à variables latentes. Ces modèles permettent d'expliquer un phénomène observé à l'aide de variables cachées. Le modèle à variables latentes le plus simple est le modèle de mélange. Les deux premières méthodes présentées en sont des exemples: la première dans un contexte de tests multiples et la deuxième dans le cadre de la définition d'un seuil d'hybridation pour des données issues de puces à ADN. Je présente également un modèle de chaînes de Markov cachées couplées pour la détection de variations du nombre de copies en génomique prenant en compte de la dépendance entre les individus, due par exemple à une proximité génétique. Pour ce modèle, nous proposons une inférence approchée fondée sur une approximation variationnelle, l'inférence exacte ne pouvant pas être envisagée dès lors que le nombre d'individus augmente. Nous définissons également un modèle à blocs latents modélisant une structure sous-jacente par bloc de lignes et colonnes adaptées à des données de comptage issue de l'écologie microbienne. Les données issues de méta-codebarres ou de métagénomique correspondent à l'abondance de chaque unité d'intérêt (par exemple micro-organisme) d'une communauté microbienne au sein d'environnement (rhizosphère de plante, tube digestif humain, océan par exemple). Ces données ont la particularité de présenter une dispersion plus forte qu'attendue sous les modèles les plus classiques (on parle de sur-dispersion). La classification croisée est une façon d'étudier les interactions entre la structure des communautés microbiennes et les échantillons biologiques dont elles sont issues. Nous avons proposé de modéliser ce phénomène à l'aide d'une distribution Poisson-Gamma et développé une autre approximation variationnelle pour ce modèle particulier ainsi qu'un critère de sélection de modèle. La flexibilité et la performance du modèle sont illustrées sur trois jeux de données réelles.Une deuxième partie est consacrée à des travaux dédiés à l'analyse de données de transcriptomique issues des technologies de puce à ADN et de séquençage de l’ARN. La première section concerne la normalisation des données (détection et correction de biais techniques) et présente deux nouvelles méthodes que j’ai proposées avec mes co-auteurs et une comparaison de méthodes à laquelle j’ai contribuée. La deuxième section dédiée à la planification expérimentale présente une méthode pour analyser les dispositifs dit en dye-switch.Dans une dernière partie, je montre à travers deux exemples de collaboration, issues respectivement d'une analyse de gènes différentiellement exprimés à partir de données issues de puces à ADN, et d'une analyse du traductome chez l'oursin à partir de données de séquençage de l'ARN, la façon dont les compétences statistiques sont mobilisées et la plus-value apportée par les statistiques aux projets de génomique. / The technological progress of the last twenty years allowed the emergence of an high-throuput biology basing on large-scale data obtained in a automatic way. The statisticians have an important role to be played in the modelling and the analysis of these numerous, noisy, sometimes heterogeneous and collected at various scales. This role can be from several nature. The statistician can propose new concepts, or new methods inspired by questions asked by this biology. He can propose a fine modelling of the phenomena observed by means of these technologies. And when methods exist and require only an adaptation, the role of the statistician can be the one of an expert, who knows the methods, their limits and the advantages.In a first part, I introduce different methods developed with my co-authors for the analysis of high-throughput biological data, based on latent variables models. These models make it possible to explain a observed phenomenon using hidden or latent variables. The simplest latent variable model is the mixture model. The first two presented methods constitutes two examples: the first in a context of multiple tests and the second in the framework of the definition of a hybridization threshold for data derived from microarrays. I also present a model of coupled hidden Markov chains for the detection of variations in the number of copies in genomics taking into account the dependence between individuals, due for example to a genetic proximity. For this model we propose an approximate inference based on a variational approximation, the exact inference not being able to be considered as the number of individuals increases. We also define a latent-block model modeling an underlying structure per block of rows and columns adapted to count data from microbial ecology. Metabarcoding and metagenomic data correspond to the abundance of each microorganism in a microbial community within the environment (plant rhizosphere, human digestive tract, ocean, for example). These data have the particularity of presenting a dispersion stronger than expected under the most conventional models (we speak of over-dispersion). Biclustering is a way to study the interactions between the structure of microbial communities and the biological samples from which they are derived. We proposed to model this phenomenon using a Poisson-Gamma distribution and developed another variational approximation for this particular latent block model as well as a model selection criterion. The model's flexibility and performance are illustrated on three real datasets.A second part is devoted to work dedicated to the analysis of transcriptomic data derived from DNA microarrays and RNA sequencing. The first section is devoted to the normalization of data (detection and correction of technical biases) and presents two new methods that I proposed with my co-authors and a comparison of methods to which I contributed. The second section devoted to experimental design presents a method for analyzing so-called dye-switch design.In the last part, I present two examples of collaboration, derived respectively from an analysis of genes differentially expressed from microrrays data, and an analysis of translatome in sea urchins from RNA-sequencing data, how statistical skills are mobilized, and the added value that statistics bring to genomics projects. Modèles de mélange Données de comptage Normalisation Analyse différentielle Métagénomique Mixture models Count data Normalization Differential analysis Metagenomics
24	Modely s Touchardovm rozdÄlenm / Models with Touchard Distribution Ibukun, Michael Abimbola January 2021 (has links) In 2018, Raul Matsushita, Donald Pianto, Bernardo B. De Andrade, Andre Can§ado & Sergio Da Silva published a paper titled âTouchard distributionâ, which presented a model that is a two-parameter extension of the Poisson distribution. This model has its normalizing constant related to the Touchard polynomials, hence the name of this model. This diploma thesis is concerned with the properties of the Touchard distribution for which delta is known. Two asymptotic tests based on two different statistics were carried out for comparison in a Touchard model with two independent samples, supported by simulations in R.
25	Generalized Principal Component Analysis: Dimensionality Reduction through the Projection of Natural Parameters Landgraf, Andrew J. 15 October 2015 (has links) No description available. Statistics Binary data Count data Dimensionality reduction Exponential family Logistic PCA Principal component analysis
26	Stochastic models for MRI lesion count sequences from patients with relapsing remitting multiple sclerosis Li, Xiaobai 14 July 2006 (has links) No description available. Statistics lesion relapsing remitting multiple sclerosis longitudinal count data queueing theory hidden Markov models
27	Exploratory and inferential multivariate statistical techniques for multidimensional count and binary data with applications in R Ntushelo, Nombasa Sheroline 12 1900 (has links) Thesis (MComm)--Stellenbosch University, 2011. / ENGLISH ABSTRACT: The analysis of multidimensional (multivariate) data sets is a very important area of research in applied statistics. Over the decades many techniques have been developed to deal with such datasets. The multivariate techniques that have been developed include inferential analysis, regression analysis, discriminant analysis, cluster analysis and many more exploratory methods. Most of these methods deal with cases where the data contain numerical variables. However, there are powerful methods in the literature that also deal with multidimensional binary and count data. The primary purpose of this thesis is to discuss the exploratory and inferential techniques that can be used for binary and count data. In Chapter 2 of this thesis we give the detail of correspondence analysis and canonical correspondence analysis. These methods are used to analyze the data in contingency tables. Chapter 3 is devoted to cluster analysis. In this chapter we explain four well-known clustering methods and we also discuss the distance (dissimilarity) measures available in the literature for binary and count data. Chapter 4 contains an explanation of metric and non-metric multidimensional scaling. These methods can be used to represent binary or count data in a lower dimensional Euclidean space. In Chapter 5 we give a method for inferential analysis called the analysis of distance. This method use a similar reasoning as the analysis of variance, but the inference is based on a pseudo F-statistic with the p-value obtained using permutations of the data. Chapter 6 contains real-world applications of these above methods on two special data sets called the Biolog data and Barents Fish data. The secondary purpose of the thesis is to demonstrate how the above techniques can be performed in the software package R. Several R packages and functions are discussed throughout this thesis. The usage of these functions is also demonstrated with appropriate examples. Attention is also given to the interpretation of the output and graphics. The thesis ends with some general conclusions and ideas for further research. / AFRIKAANSE OPSOMMING: Die analise van meerdimensionele (meerveranderlike) datastelle is ’n belangrike area van navorsing in toegepaste statistiek. Oor die afgelope dekades is daar verskeie tegnieke ontwikkel om sulke data te ontleed. Die meerveranderlike tegnieke wat ontwikkel is sluit in inferensie analise, regressie analise, diskriminant analise, tros analise en vele meer verkennende data analise tegnieke. Die meerderheid van hierdie metodes hanteer gevalle waar die data numeriese veranderlikes bevat. Daar bestaan ook kragtige metodes in die literatuur vir die analise van meerdimensionele binêre en telling data. Die primêre doel van hierdie tesis is om tegnieke vir verkennende en inferensiële analise van binêre en telling data te bespreek. In Hoofstuk 2 van hierdie tesis bespreek ons ooreenkoms analise en kanoniese ooreenkoms analise. Hierdie metodes word gebruik om data in gebeurlikheidstabelle te analiseer. Hoofstuk 3 bevat tegnieke vir tros analise. In hierdie hoofstuk verduidelik ons vier gewilde tros analise metodes. Ons bespreek ook die afstand maatstawwe wat beskikbaar is in die literatuur vir binêre en telling data. Hoofstuk 4 bevat ’n verduideliking van metriese en nie-metriese meerdimensionele skalering. Hierdie metodes kan gebruik word om binêre of telling data in ‘n lae dimensionele Euclidiese ruimte voor te stel. In Hoofstuk 5 beskryf ons ’n inferensie metode wat bekend staan as die analise van afstande. Hierdie metode gebruik ’n soortgelyke redenasie as die analise van variansie. Die inferensie hier is gebaseer op ’n pseudo F-toetsstatistiek en die p-waardes word verkry deur gebruik te maak van permutasies van die data. Hoofstuk 6 bevat toepassings van bogenoemde tegnieke op werklike datastelle wat bekend staan as die Biolog data en die Barents Fish data. Die sekondêre doel van die tesis is om te demonstreer hoe hierdie tegnieke uitgevoer word in the R sagteware. Verskeie R pakette en funksies word deurgaans bespreek in die tesis. Die gebruik van die funksies word gedemonstreer met toepaslike voorbeelde. Aandag word ook gegee aan die interpretasie van die afvoer en die grafieke. Die tesis sluit af met algemene gevolgtrekkings en voorstelle vir verdere navorsing. Statistical techniques Binary and count data Exploratory analysis Inferential analysis
28	Count data modelling and tourism demand Hellström, Jörgen January 2002 (has links) This thesis consists of four papers concerning modelling of count data and tourism demand. For three of the papers the focus is on the integer-valued autoregressive moving average model class (INARMA), and especially on the ENAR(l) model. The fourth paper studies the interaction between households' choice of number of leisure trips and number of overnight stays within a bivariate count data modelling framework. Paper [I] extends the basic INAR(1) model to enable more flexible and realistic empirical economic applications. The model is generalized by relaxing some of the model's basic independence assumptions. Results are given in terms of first and second conditional and unconditional order moments. Extensions to general INAR(p), time-varying, multivariate and threshold models are also considered. Estimation by conditional least squares and generalized method of moments techniques is feasible. Monte Carlo simulations for two of the extended models indicate reasonable estimation and testing properties. An illustration based on the number of Swedish mechanical paper and pulp mills is considered. Paper[II] considers the robustness of a conventional Dickey-Fuller (DF) test for the testing of a unit root in the INAR(1) model. Finite sample distributions for a model with Poisson distributed disturbance terms are obtained by Monte Carlo simulation. These distributions are wider than those of AR(1) models with normal distributed error terms. As the drift and sample size, respectively, increase the distributions appear to tend to T-2) and standard normal distributions. The main results are summarized by an approximating equation that also enables calculation of critical values for any sample and drift size. Paper[III] utilizes the INAR(l) model to model the day-to-day movements in the number of guest nights in hotels. By cross-sectional and temporal aggregation an INARMA(1,1) model for monthly data is obtained. The approach enables easy interpretation and econometric modelling of the parameters, in terms of daily mean check-in and check-out probability. Empirically approaches accounting for seasonality by dummies and using differenced series, as well as forecasting, are studied for a series of Norwegian guest nights in Swedish hotels. In a forecast evaluation the improvements by introducing economic variables is minute. Paper[IV] empirically studies household's joint choice of the number of leisure trips and the total night to stay on these trips. The paper introduces a bivariate count hurdle model to account for the relative high frequencies of zeros. A truncated bivariate mixed Poisson lognormal distribution, allowing for both positive as well as negative correlation between the count variables, is utilized. Inflation techniques are used to account for clustering of leisure time to weekends. Simulated maximum likelihood is used as estimation method. A small policy study indicates that households substitute trips for nights as the travel costs increase. / <p>Härtill 4 uppsatser.</p> / digitalisering@umu Time series Count data INARMA Unit root Aggregation Forecasting Tourism Truncation Inflation Simulated maximum likelihood Bivariate hurdle model.
29	Modeling Mortality Rates In The WikiLeaks Afghanistan War Logs Rusch, Thomas, Hofmarcher, Paul, Hatzinger, Reinhold, Hornik, Kurt 09 1900 (has links) (PDF) The WikiLeaks Afghanistan war logs contain more than 76 000 reports about fatalities and their circumstances in the US led Afghanistan war, covering the period from January 2004 to December 2009. In this paper we use those reports to build statistical models to help us understand the mortality rates associated with specific circumstances. We choose an approach that combines Latent Dirichlet Allocation (LDA) with negative binomial based recursive partitioning. LDA is used to process the natural language information contained in each report summary. We estimate latent topics and assign each report to one of them. These topics - in addition to other variables in the data set - subsequently serve as explanatory variables for modeling the number of fatalities of the civilian population, ISAF Forces, Anti-Coalition Forces and the Afghan National Police or military as well as the combined number of fatalities. Modeling is carried out with manifest mixtures of negative binomial distributions estimated with model-based recursive partitioning. For each group of fatalities, we identify segments with different mortality rates that correspond to a small number of topics and other explanatory variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This provides an unprecedented description of the war in Afghanistan covered by the war logs. Additionally, our approach can serve as an example as to how modern statistical methods may lead to extra insight if applied to problems of data journalism. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
30	[en] A BIVARIATE GARMA MODEL WITH CONDITIONAL POISSON DISTRIBUTION / [pt] UM MODELO GARMA BIVARIADO COM DISTRIBUIÇÃO CONDICIONAL DE POISSON PRISCILLA FERREIRA DA SILVA 02 May 2014 (has links) [pt] Os modelos lineares generalizados auto regressivos com médias móveis (do inglês GARMA), possibilitam a modelagem de séries temporais de dados de contagem com estrutura de correlação similares aos dos modelos ARMA. Neste trabalho é desenvolvida uma extensão multivariada do modelo GARMA, considerando a especificação de um modelo Poisson bivariado a partir da distribuição de Kocherlakota e Kocherlakota (1992), a qual será denominada de modelo Poisson BGARMA. O modelo proposto é adequado para séries de contagens estacionárias, sendo possível, através de funções de ligação apropriadas, introduzir deterministicamente o efeito de sazonalidade e de tendência. A investigação das propriedades usuais dos estimadores de máxima verossimilhança (viés, eficiência e distribuição) foi realizada através de simulações de Monte Carlo. Com o objetivo de comparar o desempenho e a aderência do modelo proposto, este foi aplicado a dois pares de séries reais bivariadas de dados de contagem. O primeiro par de séries apresenta as contagens mensais de óbitos neonatais para duas faixas de dias de vida. O segundo par de séries refere-se a contagens de acidentes de automóveis diários em dois períodos: vespertino e noturno. Os resultados do modelo proposto, quando comparados com aqueles obtidos através do ajuste de um modelo Gaussiano bivariado Vector Autoregressive (VAR), indicam que o modelo Poisson BGARMA é capaz de capturar de forma adequada as variações de pares de séries de dados de contagem e de realizar previsões com erros aceitáveis, além de produzir previsões probabilísticas para as séries. / [en] Generalized autoregressive linear models with moving average (GARMA) allow the modeling of discrete time series with correlation structure similar to those of ARMA’s models. In this work we developed an extension of a univariate Poisson GARMA model by considerating the specification of a bivariate Poisson model through the distribution presented on Kocherlakota and Kocherlakota (1992), which will be called Poisson BGARMA model. The proposed model not only is suitable for stationary discrete series, but also allows us to take into consideration the effect of seasonality and trend. The investigation of the usual properties of the maximum likelihood estimators (bias, efficiency and distribution) was performed using Monte Carlo simulations. Aiming to compare the performance and compliance of the proposed model, it was applied to two pairs of series of bivariate count data. The first pair is the monthly counts of neonatal deaths to two lanes of days. The second pair refers to counts of daily car accidents in two distinct periods: afternoon and evening. The results of our model when compared with those obtained by fitting a bivariate Vector Autoregressive Gaussian model (VAR) indicates that the Poisson BGARMA model is able to proper capture the variability of bivariate vectors of real time series of count data, producing forecasts with acceptable errors and allowing one to obtain probability forecasts. [pt] DADOS DE CONTAGEM [en] COUNT DATA [pt] MODELOS LINEARES GENERALIZADOS [en] GENERALIZED LINEAR MODELS [pt] GARMA [en] GARMA [pt] POISSON BIVARIADO [en] BIVARIATE POISSON

Search results