Global ETD Search

31	Exploratory and inferential multivariate statistical techniques for multidimensional count and binary data with applications in R Ntushelo, Nombasa Sheroline 12 1900 (has links) Thesis (MComm)--Stellenbosch University, 2011. / ENGLISH ABSTRACT: The analysis of multidimensional (multivariate) data sets is a very important area of research in applied statistics. Over the decades many techniques have been developed to deal with such datasets. The multivariate techniques that have been developed include inferential analysis, regression analysis, discriminant analysis, cluster analysis and many more exploratory methods. Most of these methods deal with cases where the data contain numerical variables. However, there are powerful methods in the literature that also deal with multidimensional binary and count data. The primary purpose of this thesis is to discuss the exploratory and inferential techniques that can be used for binary and count data. In Chapter 2 of this thesis we give the detail of correspondence analysis and canonical correspondence analysis. These methods are used to analyze the data in contingency tables. Chapter 3 is devoted to cluster analysis. In this chapter we explain four well-known clustering methods and we also discuss the distance (dissimilarity) measures available in the literature for binary and count data. Chapter 4 contains an explanation of metric and non-metric multidimensional scaling. These methods can be used to represent binary or count data in a lower dimensional Euclidean space. In Chapter 5 we give a method for inferential analysis called the analysis of distance. This method use a similar reasoning as the analysis of variance, but the inference is based on a pseudo F-statistic with the p-value obtained using permutations of the data. Chapter 6 contains real-world applications of these above methods on two special data sets called the Biolog data and Barents Fish data. The secondary purpose of the thesis is to demonstrate how the above techniques can be performed in the software package R. Several R packages and functions are discussed throughout this thesis. The usage of these functions is also demonstrated with appropriate examples. Attention is also given to the interpretation of the output and graphics. The thesis ends with some general conclusions and ideas for further research. / AFRIKAANSE OPSOMMING: Die analise van meerdimensionele (meerveranderlike) datastelle is ’n belangrike area van navorsing in toegepaste statistiek. Oor die afgelope dekades is daar verskeie tegnieke ontwikkel om sulke data te ontleed. Die meerveranderlike tegnieke wat ontwikkel is sluit in inferensie analise, regressie analise, diskriminant analise, tros analise en vele meer verkennende data analise tegnieke. Die meerderheid van hierdie metodes hanteer gevalle waar die data numeriese veranderlikes bevat. Daar bestaan ook kragtige metodes in die literatuur vir die analise van meerdimensionele binêre en telling data. Die primêre doel van hierdie tesis is om tegnieke vir verkennende en inferensiële analise van binêre en telling data te bespreek. In Hoofstuk 2 van hierdie tesis bespreek ons ooreenkoms analise en kanoniese ooreenkoms analise. Hierdie metodes word gebruik om data in gebeurlikheidstabelle te analiseer. Hoofstuk 3 bevat tegnieke vir tros analise. In hierdie hoofstuk verduidelik ons vier gewilde tros analise metodes. Ons bespreek ook die afstand maatstawwe wat beskikbaar is in die literatuur vir binêre en telling data. Hoofstuk 4 bevat ’n verduideliking van metriese en nie-metriese meerdimensionele skalering. Hierdie metodes kan gebruik word om binêre of telling data in ‘n lae dimensionele Euclidiese ruimte voor te stel. In Hoofstuk 5 beskryf ons ’n inferensie metode wat bekend staan as die analise van afstande. Hierdie metode gebruik ’n soortgelyke redenasie as die analise van variansie. Die inferensie hier is gebaseer op ’n pseudo F-toetsstatistiek en die p-waardes word verkry deur gebruik te maak van permutasies van die data. Hoofstuk 6 bevat toepassings van bogenoemde tegnieke op werklike datastelle wat bekend staan as die Biolog data en die Barents Fish data. Die sekondêre doel van die tesis is om te demonstreer hoe hierdie tegnieke uitgevoer word in the R sagteware. Verskeie R pakette en funksies word deurgaans bespreek in die tesis. Die gebruik van die funksies word gedemonstreer met toepaslike voorbeelde. Aandag word ook gegee aan die interpretasie van die afvoer en die grafieke. Die tesis sluit af met algemene gevolgtrekkings en voorstelle vir verdere navorsing. Statistical techniques Binary and count data Exploratory analysis Inferential analysis
32	Count data modelling and tourism demand Hellström, Jörgen January 2002 (has links) This thesis consists of four papers concerning modelling of count data and tourism demand. For three of the papers the focus is on the integer-valued autoregressive moving average model class (INARMA), and especially on the ENAR(l) model. The fourth paper studies the interaction between households' choice of number of leisure trips and number of overnight stays within a bivariate count data modelling framework. Paper [I] extends the basic INAR(1) model to enable more flexible and realistic empirical economic applications. The model is generalized by relaxing some of the model's basic independence assumptions. Results are given in terms of first and second conditional and unconditional order moments. Extensions to general INAR(p), time-varying, multivariate and threshold models are also considered. Estimation by conditional least squares and generalized method of moments techniques is feasible. Monte Carlo simulations for two of the extended models indicate reasonable estimation and testing properties. An illustration based on the number of Swedish mechanical paper and pulp mills is considered. Paper[II] considers the robustness of a conventional Dickey-Fuller (DF) test for the testing of a unit root in the INAR(1) model. Finite sample distributions for a model with Poisson distributed disturbance terms are obtained by Monte Carlo simulation. These distributions are wider than those of AR(1) models with normal distributed error terms. As the drift and sample size, respectively, increase the distributions appear to tend to T-2) and standard normal distributions. The main results are summarized by an approximating equation that also enables calculation of critical values for any sample and drift size. Paper[III] utilizes the INAR(l) model to model the day-to-day movements in the number of guest nights in hotels. By cross-sectional and temporal aggregation an INARMA(1,1) model for monthly data is obtained. The approach enables easy interpretation and econometric modelling of the parameters, in terms of daily mean check-in and check-out probability. Empirically approaches accounting for seasonality by dummies and using differenced series, as well as forecasting, are studied for a series of Norwegian guest nights in Swedish hotels. In a forecast evaluation the improvements by introducing economic variables is minute. Paper[IV] empirically studies household's joint choice of the number of leisure trips and the total night to stay on these trips. The paper introduces a bivariate count hurdle model to account for the relative high frequencies of zeros. A truncated bivariate mixed Poisson lognormal distribution, allowing for both positive as well as negative correlation between the count variables, is utilized. Inflation techniques are used to account for clustering of leisure time to weekends. Simulated maximum likelihood is used as estimation method. A small policy study indicates that households substitute trips for nights as the travel costs increase. / <p>Härtill 4 uppsatser.</p> / digitalisering@umu Time series Count data INARMA Unit root Aggregation Forecasting Tourism Truncation Inflation Simulated maximum likelihood Bivariate hurdle model.
33	Modeling Mortality Rates In The WikiLeaks Afghanistan War Logs Rusch, Thomas, Hofmarcher, Paul, Hatzinger, Reinhold, Hornik, Kurt 09 1900 (has links) (PDF) The WikiLeaks Afghanistan war logs contain more than 76 000 reports about fatalities and their circumstances in the US led Afghanistan war, covering the period from January 2004 to December 2009. In this paper we use those reports to build statistical models to help us understand the mortality rates associated with specific circumstances. We choose an approach that combines Latent Dirichlet Allocation (LDA) with negative binomial based recursive partitioning. LDA is used to process the natural language information contained in each report summary. We estimate latent topics and assign each report to one of them. These topics - in addition to other variables in the data set - subsequently serve as explanatory variables for modeling the number of fatalities of the civilian population, ISAF Forces, Anti-Coalition Forces and the Afghan National Police or military as well as the combined number of fatalities. Modeling is carried out with manifest mixtures of negative binomial distributions estimated with model-based recursive partitioning. For each group of fatalities, we identify segments with different mortality rates that correspond to a small number of topics and other explanatory variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This provides an unprecedented description of the war in Afghanistan covered by the war logs. Additionally, our approach can serve as an example as to how modern statistical methods may lead to extra insight if applied to problems of data journalism. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
34	[en] A BIVARIATE GARMA MODEL WITH CONDITIONAL POISSON DISTRIBUTION / [pt] UM MODELO GARMA BIVARIADO COM DISTRIBUIÇÃO CONDICIONAL DE POISSON PRISCILLA FERREIRA DA SILVA 02 May 2014 (has links) [pt] Os modelos lineares generalizados auto regressivos com médias móveis (do inglês GARMA), possibilitam a modelagem de séries temporais de dados de contagem com estrutura de correlação similares aos dos modelos ARMA. Neste trabalho é desenvolvida uma extensão multivariada do modelo GARMA, considerando a especificação de um modelo Poisson bivariado a partir da distribuição de Kocherlakota e Kocherlakota (1992), a qual será denominada de modelo Poisson BGARMA. O modelo proposto é adequado para séries de contagens estacionárias, sendo possível, através de funções de ligação apropriadas, introduzir deterministicamente o efeito de sazonalidade e de tendência. A investigação das propriedades usuais dos estimadores de máxima verossimilhança (viés, eficiência e distribuição) foi realizada através de simulações de Monte Carlo. Com o objetivo de comparar o desempenho e a aderência do modelo proposto, este foi aplicado a dois pares de séries reais bivariadas de dados de contagem. O primeiro par de séries apresenta as contagens mensais de óbitos neonatais para duas faixas de dias de vida. O segundo par de séries refere-se a contagens de acidentes de automóveis diários em dois períodos: vespertino e noturno. Os resultados do modelo proposto, quando comparados com aqueles obtidos através do ajuste de um modelo Gaussiano bivariado Vector Autoregressive (VAR), indicam que o modelo Poisson BGARMA é capaz de capturar de forma adequada as variações de pares de séries de dados de contagem e de realizar previsões com erros aceitáveis, além de produzir previsões probabilísticas para as séries. / [en] Generalized autoregressive linear models with moving average (GARMA) allow the modeling of discrete time series with correlation structure similar to those of ARMA’s models. In this work we developed an extension of a univariate Poisson GARMA model by considerating the specification of a bivariate Poisson model through the distribution presented on Kocherlakota and Kocherlakota (1992), which will be called Poisson BGARMA model. The proposed model not only is suitable for stationary discrete series, but also allows us to take into consideration the effect of seasonality and trend. The investigation of the usual properties of the maximum likelihood estimators (bias, efficiency and distribution) was performed using Monte Carlo simulations. Aiming to compare the performance and compliance of the proposed model, it was applied to two pairs of series of bivariate count data. The first pair is the monthly counts of neonatal deaths to two lanes of days. The second pair refers to counts of daily car accidents in two distinct periods: afternoon and evening. The results of our model when compared with those obtained by fitting a bivariate Vector Autoregressive Gaussian model (VAR) indicates that the Poisson BGARMA model is able to proper capture the variability of bivariate vectors of real time series of count data, producing forecasts with acceptable errors and allowing one to obtain probability forecasts. [pt] DADOS DE CONTAGEM [en] COUNT DATA [pt] MODELOS LINEARES GENERALIZADOS [en] GENERALIZED LINEAR MODELS [pt] GARMA [en] GARMA [pt] POISSON BIVARIADO [en] BIVARIATE POISSON
35	From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering Frühwirth-Schnatter, Sylvia, Malsiner-Walli, Gertraud January 2019 (has links) (PDF) In model-based clustering mixture models are used to group data points into clusters. A useful concept introduced for Gaussian mixtures by Malsiner Walli et al. (Stat Comput 26:303-324, 2016) are sparse finite mixtures, where the prior distribution on the weight distribution of a mixture with K components is chosen in such a way that a priori the number of clusters in the data is random and is allowed to be smaller than K with high probability. The number of clusters is then inferred a posteriori from the data. The present paper makes the following contributions in the context of sparse finite mixture modelling. First, it is illustrated that the concept of sparse finite mixture is very generic and easily extended to cluster various types of non-Gaussian data, in particular discrete data and continuous multivariate data arising from non-Gaussian clusters. Second, sparse finite mixtures are compared to Dirichlet process mixtures with respect to their ability to identify the number of clusters. For both model classes, a random hyper prior is considered for the parameters determining the weight distribution. By suitable matching of these priors, it is shown that the choice of this hyper prior is far more influential on the cluster solution than whether a sparse finite mixture or a Dirichlet process mixture is taken into consideration.
36	Modelos estatísticos para mapeamento de QTL associados a dados de contagem / Statistical models for QTL mapping associated to counting data Kamogawa, Karen Pallotta Tunin 15 May 2009 (has links) Este estudo teve por objetivo analisar e comparar metodologias estatísticas para fins de mapeamento de QTL associados à resistência a ectoparasitas em bovinos. Os animais, submetidos à infestação artificial, foram periodicamente avaliados por contagens, como número de carrapatos. Estes dados se caracterizam como medidas repetidas e, via de regra, não atendem ou atendem parcialmente as exigências usuais da análise, para mapeamento de QTL, dentre elas a de apresentar distribuição normal e independência dos erros. Ainda não está bem definido qual seria a melhor estratégia para analisar dados com o perfil descrito. Algumas alternativas seriam transformações de dados que permitam o uso dos programas já disponíveis, ou o desenvolvimento de programas que utilizem outras distribuições como Poisson ou Poisson inflada de zeros (ZIP). Esta proposta está inserida na parceria entre EMBRAPA - Gado de Leite e a ESALQ/USP, para desenvolvimento do projeto de mapeamento de QTL em bovinos mestiços (Gir x Holandês), para várias características incluindo a resistência a parasitas. Foram utilizados 263 animais F2, genotipados com 5 marcadores moleculares no cromossomo 23, na tentativa de mapear QTL para característica de resistência a carrapatos. Dados coletados naquela população F2 e dados simulados em diferentes cenários, serão a base para a comparação de estratégias de análise e mapeamento de QTL. Os modelos de mapeamento clássico, assim como a utilização de transformações dos dados originais foram comparados a modelos de regressão Poisson e modelo ZIP. Os modelos Poisson e ZIP apresentaram os melhores resultados quando trabalhamos com dados de contagem inflacionados de zeros, porém em outros cenários a transformação dos dados originais se mostrou igualmente eficiente. Dependendo do propósito do mapeamento (seja ele localizar ou estimar o efeito), cada modelo possui suas vantagens e suas limitações. Assim, sempre é recomendável uma prévia análise descritiva dos dados para que o melhor modelo seja utilizado. / This study has as main objective to analyze and compare statistical approaches to QTL mapping for parasites resistance in bovines. The animals, under artificial infestation, were periodically evaluated by counting, as ticks count. These data are characterized as repeated measures and, usually, dont follow or partially follow the usual requirements for the analysis, for QTL mapping, that is to present normal distribution and error independence. It is not clear yet which will be the best strategy to analyze this kind of data. Some alternatives could be data transformation that allows the use of software available on the web, or the development of specific programs that use other types of distribution like Poisson or Zero Inflated Poisson (ZIP).This work is an association between EMBRAPA Gado de Leite and ESALQ/USP, to the development of the QTL mapping project for crossbred bovines (Gyr x Holstein), for different characteristics including the parasite resistance. Were used 263 animals F2, genotyped for 5 molecular markers on the chromosome 23, aiming to map QTL for characteristics of parasite resistance. Data collected on this F2 population and simulated data in different scenarios will be the base for the strategies of the QTL mapping approaches comparison. The classical mapping models and the use of data transformation of the original data were compared to Poisson regression and ZIP models. The Poisson and ZIP models presented the best results when working with zero inflated count data however in some other scenarios the data transformation showed similar efficiency. Depending on the purpose of the mapping (this meaning locate or estimate the QTL effect) each model has its vantages and its limitations. This way, it is always advisable to make a previous descriptive analysis of the data to better choose the model. Bovines Bovinos Carrapatos Count Data Dados de contagem Mapeamento genético QTL Resistência genética animal. Selection Ticks ZIP model
37	Differences and similarities in work absence behavior : - empirical evidence from micro data Nilsson, Maria January 2005 (has links) This thesis consists of three self-contained essays about absenteeism. Essay I analyzes if the design of the insurance system affects work absence, i.e. the classic insurance problem of moral hazard. Several reforms of the sickness insurance system were implemented during the period 1991-1996. Using Negative binomial models with fixed effects, the analysis show that both workers and employers changed their behavior due to the reforms. We also find that the extent of moral hazard varies depending on work contract structures. The reforms reducing the compensation levels decreased workers’ absence, both the number of absent days and the number of absence spells. The reform in 1992, introducing sick pay paid by the employers, also decreased absence levels, which probably can be explained by changes in personnel policy such as increased use of monitoring and screening of workers. Essay II examines the background to gender differences in work absence. Women are found, as in many earlier studies, to have higher absence levels than men. Our analysis, using finite mixture models, reveals that there are a group of women, comprised of about 41% of the women in our sample, that have a high average demand of absence. Among men, the high demand group is smaller consisting of about 36% of the male sample. The absence behavior differs as much between groups within gender as it does between men and women. The access to panel data covering the period 1971-1991 enables an analysis of the increased gender gap over time. Our analysis shows that the increased gender gap can be attributed to changes in behavior rather than in observable characteristics. Essay III analyzes the difference in work absence between natives and immigrants. Immigrants are found to have higher absence than natives when measured as the number of absent days. For the number of absence spells, the pattern for immigrants and natives is about the same. The analysis, using panel data and count data models, show that natives and immigrants have different characteristics concerning family situation, work conditions and health. We also find that natives and immigrants respond differently to these characteristics. We find, for example, that the absence of natives and immigrants are differently related to both economic incentives and work environment. Finally, our analysis shows that differences in work conditions and work environment only can explain a minor part of the ethnic differences in absence during the 1980’s. moral hazard gender difference immigrants panel data count data models fixed effects finite mixture models Economics Nationalekonomi
38	A novel approach to modeling and predicting crash frequency at rural intersections by crash type and injury severity level Deng, Jun, active 2013 24 March 2014 (has links) Safety at intersections is of significant interest to transportation professionals due to the large number of possible conflicts that occur at those locations. In particular, rural intersections have been recognized as one of the most hazardous locations on roads. However, most models of crash frequency at rural intersections, and road segments in general, do not differentiate between crash type (such as angle, rear-end or sideswipe) and injury severity (such as fatal injury, non-fatal injury, possible injury or property damage only). Thus, there is a need to be able to identify the differential impacts of intersection-specific and other variables on crash types and severity levels. This thesis builds upon the work of Bhat et al., (2013b) to formulate and apply a novel approach for the joint modeling of crash frequency and combinations of crash type and injury severity. The proposed framework explicitly links a count data model (to model crash frequency) with a discrete choice model (to model combinations of crash type and injury severity), and uses a multinomial probit kernel for the discrete choice model and introduces unobserved heterogeneity in both the crash frequency model and the discrete choice model, while also accommodates excess of zeros. The results show that the type of traffic control and the number of entering roads are the most important determinants of crash counts and crash type/injury severity, and the results from our analysis underscore the value of our proposed model for data fit purposes as well as to accurately estimate variable effects. / text Crashes in rural intersections Injury severity level Crash type Multivariate count data Generalized ordered-response Multinational probit Multivariate normal distribution
39	Modelos estatísticos para mapeamento de QTL associados a dados de contagem / Statistical models for QTL mapping associated to counting data Karen Pallotta Tunin Kamogawa 15 May 2009 (has links) Este estudo teve por objetivo analisar e comparar metodologias estatísticas para fins de mapeamento de QTL associados à resistência a ectoparasitas em bovinos. Os animais, submetidos à infestação artificial, foram periodicamente avaliados por contagens, como número de carrapatos. Estes dados se caracterizam como medidas repetidas e, via de regra, não atendem ou atendem parcialmente as exigências usuais da análise, para mapeamento de QTL, dentre elas a de apresentar distribuição normal e independência dos erros. Ainda não está bem definido qual seria a melhor estratégia para analisar dados com o perfil descrito. Algumas alternativas seriam transformações de dados que permitam o uso dos programas já disponíveis, ou o desenvolvimento de programas que utilizem outras distribuições como Poisson ou Poisson inflada de zeros (ZIP). Esta proposta está inserida na parceria entre EMBRAPA - Gado de Leite e a ESALQ/USP, para desenvolvimento do projeto de mapeamento de QTL em bovinos mestiços (Gir x Holandês), para várias características incluindo a resistência a parasitas. Foram utilizados 263 animais F2, genotipados com 5 marcadores moleculares no cromossomo 23, na tentativa de mapear QTL para característica de resistência a carrapatos. Dados coletados naquela população F2 e dados simulados em diferentes cenários, serão a base para a comparação de estratégias de análise e mapeamento de QTL. Os modelos de mapeamento clássico, assim como a utilização de transformações dos dados originais foram comparados a modelos de regressão Poisson e modelo ZIP. Os modelos Poisson e ZIP apresentaram os melhores resultados quando trabalhamos com dados de contagem inflacionados de zeros, porém em outros cenários a transformação dos dados originais se mostrou igualmente eficiente. Dependendo do propósito do mapeamento (seja ele localizar ou estimar o efeito), cada modelo possui suas vantagens e suas limitações. Assim, sempre é recomendável uma prévia análise descritiva dos dados para que o melhor modelo seja utilizado. / This study has as main objective to analyze and compare statistical approaches to QTL mapping for parasites resistance in bovines. The animals, under artificial infestation, were periodically evaluated by counting, as ticks count. These data are characterized as repeated measures and, usually, dont follow or partially follow the usual requirements for the analysis, for QTL mapping, that is to present normal distribution and error independence. It is not clear yet which will be the best strategy to analyze this kind of data. Some alternatives could be data transformation that allows the use of software available on the web, or the development of specific programs that use other types of distribution like Poisson or Zero Inflated Poisson (ZIP).This work is an association between EMBRAPA Gado de Leite and ESALQ/USP, to the development of the QTL mapping project for crossbred bovines (Gyr x Holstein), for different characteristics including the parasite resistance. Were used 263 animals F2, genotyped for 5 molecular markers on the chromosome 23, aiming to map QTL for characteristics of parasite resistance. Data collected on this F2 population and simulated data in different scenarios will be the base for the strategies of the QTL mapping approaches comparison. The classical mapping models and the use of data transformation of the original data were compared to Poisson regression and ZIP models. The Poisson and ZIP models presented the best results when working with zero inflated count data however in some other scenarios the data transformation showed similar efficiency. Depending on the purpose of the mapping (this meaning locate or estimate the QTL effect) each model has its vantages and its limitations. This way, it is always advisable to make a previous descriptive analysis of the data to better choose the model. Bovinos Carrapatos Dados de contagem Mapeamento genético Resistência genética animal. Bovines Count Data QTL Selection Ticks ZIP model
40	Exploring the Importance of Accounting for Nonlinearity in Correlated Count Regression Systems from the Perspective of Causal Estimation and Inference Zhang, Yilei 07 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The main motivation for nearly all empirical economic research is to provide scientific evidence that can be used to assess causal relationships of interest. Essential to such assessments is the rigorous specification and accurate estimation of parameters that characterize the causal relationship between a presumed causal variable of interest, whose value is to be set and altered in the context of a relevant counterfactual and a designated outcome of interest. Relationships of this type are typically characterized by an effect parameter (EP) and estimation of the EP is the objective of the empirical analysis. The present research focuses on cases in which the regression outcome of interest is a vector that has count-valued elements (i.e., the model under consideration comprises a multi-equation system of equations). This research examines the importance of account for nonlinearity and cross-equation correlations in correlated count regression systems from the perspective of causal estimation and inference. We evaluate the efficiency and accuracy gains of estimating bivariate count valued systems-of-equations models by comparing three pairs of models: (1) Zellner’s Seemingly Unrelated Regression (SUR) versus Count-Outcome SUR - Conway Maxwell Poisson (CMP); (2) CMP SUR versus Single-Equation CMP Approach; (3) CMP SUR versus Poisson SUR. We show via simulation studies that it is more efficient to estimate jointly than equation-by-equation, it is more efficient to account for nonlinearity. We also apply our model and estimation method to real-world health care utilization data, where the dependent variables are correlated counts: count of physician office-visits, and count of non-physician health professional office-visits. The presumed causal variable is private health insurance status. Our model results in a reduction of at least 30% in standard errors for key policy EP (e.g., Average Incremental Effect). Our results are enabled by our development of a Stata program for approximating two-dimensional integrals via Gauss-Legendre Quadrature. Correlated Count Data Gauss-Legendre Quadrature Heath Care Demand Policy Effect Estimation System of Equations Estimation Treatment Effects

Search results