Global ETD Search

1	Regression analysis of big count data via a-optimal subsampling Zhao, Xiaofeng 19 July 2018 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / There are two computational bottlenecks for Big Data analysis: (1) the data is too large for a desktop to store, and (2) the computing task takes too long waiting time to finish. While the Divide-and-Conquer approach easily breaks the first bottleneck, the Subsampling approach simultaneously beat both of them. The uniform sampling and the nonuniform sampling--the Leverage Scores sampling-- are frequently used in the recent development of fast randomized algorithms. However, both approaches, as Peng and Tan (2018) have demonstrated, are not effective in extracting important information from data. In this thesis, we conduct regression analysis for big count data via A-optimal subsampling. We derive A-optimal sampling distributions by minimizing the trace of certain dispersion matrices in general estimating equations (GEE). We point out that the A-optimal distributions have the same running times as the full data M-estimator. To fast compute the distributions, we propose the A-optimal Scoring Algorithm, which is implementable by parallel computing and sequentially updatable for stream data, and has faster running time than that of the full data M-estimator. We present asymptotic normality for the estimates in GEE's and in generalized count regression. A data truncation method is introduced. We conduct extensive simulations to evaluate the numerical performance of the proposed sampling distributions. We apply the proposed A-optimal subsampling method to analyze two real count data sets, the Bike Sharing data and the Blog Feedback data. Our results in both simulations and real data sets indicated that the A-optimal distributions substantially outperformed the uniform distribution, and have faster running times than the full data M-estimators. Big Count Data A-optimal Regression
2	Estimation of zero-inflated count time series models with and without covariates Ghanney, Bartholomew Embir 03 November 2015 (has links) Zero inflation occurs when the proportion of zeros of a model is greater than the proportion of zeros of the corresponding Poisson model. This situation is very common in count data. In order to model zero inflated count time series data, we propose the zero inflated autoregressive conditional Poisson (ZIACP) model by the extending the autoregressive conditional poisson (ACP) model of Ghahramani and Thavaneswaran (2009). The stationarity conditions and the autocorrelation functions of the ZIACP model are provided. Based on the expectation maximization (EM) algorithm an estimation method is developed. A simulation study shows that the estimation method is accurate and reliable as long as the sample size is reasonably high. Three real data examples, syphilis data Yang (2012), arson data Zhu (2012) and polio data Kitromilidou and Fokianos (2015) are studied to compare the performance of the proposed model with other competitive models in the literature. / February 2016 Count data Zero inflated process Poisson
3	Econometric analysis of non-standard count data Godwin, Ryan T. 21 November 2012 (has links) This thesis discusses various issues in the estimation of models for count data. In the first part of the thesis, we derive an analytic expression for the bias of the maximum likelihood estimator (MLE) of the parameter in a doubly-truncated Poisson distribution, which proves highly effective as a means of bias correction. We explore the circumstances under which bias is likely to be problematic, and provide some indication of the statistical significance of the bias. Over a range of sample sizes, our method outperforms the alternative of bias correction via the parametric bootstrap. We show that MLEs obtained from sample sizes which elicit appreciable bias also have sampling distributions which are unsuited to be approximated by large-sample asymptotics, and bootstrapping confidence intervals around our bias-adjusted estimator is preferred, as two tiers of bootstrapping may incur a heavy computational burden. Modelling count data where the counts are strictly positive is often accomplished using a positive Poisson distribution. Inspection of the data sometimes reveals an excess of ones, analogous to zero-inflation in a regular Poisson model. The latter situation has well developed methods for modelling and testing, such as the zero-inflated Poisson (ZIP) model, and a score test for zero-inflation in a ZIP model. The issue of count inflation in a positive Poisson distribution does not seem to have been considered in a similar way. In the second part of the thesis, we propose a one-inflated positive Poisson (OIPP) model, and develop a score test to determine whether there are “too many” ones for a positive Poisson model to fit well. We explore the performance of our score test, and compare it to a likelihood ratio test, via Monte Carlo simulation. We find that the score test performs well, and that the OIPP model may be useful in many cases. The third part of the thesis considers the possibility of one-inflation in zero-truncated data, when overdispersion is present. We propose a new model to deal with such a phenomenon, the one-inflated zero-truncated negative binomial (OIZTNB) model. The finite sample properties of the maximum likelihood estimators for the parameters of such a model are discussed. This Chapter considers likelihood ratio tests which assist in specifying the OIZTNB model, and investigates the finite sample properties of such tests. The OIZTNB model is illustrated using the medpar data set, which describes the hospital length of stay for a set of patients in Arizona. This is a data set that is widely used to highlight the merits of the zero-truncated negative binomial (ZTNB) model. We find that our OIZTNB model fits the data better than does the ZTNB model, and this leads us to conclude that the data are generated by a one-inflated process. / Graduate econometrics count-data truncation count-inflation overdispersion
4	New Nonparametric Tests for Panel Count Data Zhao, Xingqiu 04 1900 (has links) <p> Statistical analysis of panel count data is an important topic to a number of applied fields including biology, engineering, econometrics, medicine, and public health. Panel count data include observations on subjects over multiple time points where the response variable is a count or recurrent event process when only the numbers of events occurring between observation time points are available. The choice of method for analyzing panel count data usually depends on the relationship between the observation times and the response variable and questions of interest. Most of the previous research was done when the observation times are fixed. If the observation times are random, the data structure becomes more challenging since the observation times for individual subjects vary in addition to the incompleteness of observations. The model-based approach was used to deal with such data. However, this method relies on extra assumptions on the observation scheme and thus is restrictive in practice. In this dissertation, we discuss the problem of multi-sample nonparametric comparison of counting processes with panel count data, which arise naturally when recurrent events are considered. For the problem considered, we develop some new nonparametric tests.</p> <p> First, we construct a class of nonparametric test statistics based on the integrated weighted differences between the estimated mean functions of the count processes, where the isotonic regression estimate is used for the mean functions. The asymptotic distributions of the proposed statistics are derived and their finite-sample properties are examined through Monte Carlo simulations. A panel count data from a cancer study is analyzed and presented as an illustrative example.</p> <p>As shown through Monte Carlo simulations, the nonparametric maximum likelihood estimator (NPMLE) of the mean function is more efficient than the nonparametric maximum pseudo-likelihood estimator (NPMPLE). However, no nonparametric tests have been discussed in the literature for panel count data based on the NPMLE since the NPMLE is more complicated both theoretically and computationally. It is, therefore, particularly important to develop nonparametric tests based on the NPMLE for panel count data.</p> <p> In the second part of the dissertation, we focus on the situation when treatment indicators can be regarded as independent and identically distributed random variables and propose a nonparametric test in this case using the maximum likelihood estimator. The asymptotic property of the test statistic is derived. Simulation studies are carried out which suggest that the proposed method works well for practical situations, and is more powerful than the existing tests based on the NPMPLEs of the mean functions.</p> <p>In the third part of the dissertation, we consider more general situations. We construct a class of nonparametric tests based on the accumulated weighted differences between the rates of increase of the estimated mean functions of the counting processes over observation times, where the nonparametric maximum likelihood approach is used to estimate the mean functions instead of the nonparametric maximum pseudolikelihood. The asymptotic distributions of the proposed statistics are derived and their finite-sample properties are evaluated by means of Monte Carlo simulations. The simulation results show that the proposed methods work quite well and the tests based on NPMLE are more powerful than those based on NPMPLE. Two real data sets are analyzed and presented as illustrative examples.</p> <p>The last part of the dissertation discusses a special type of panel count data, namely, current status or case 1 interval-censored data. Such data often occur in tumorigenicity experiments. For nonparametric two-sample comparison based on censored or interval-censored data, most of the existing methods have focused on testing the hypothesis that specifies the two population distributions to be identical under the assumption that observation or censoring times have the same distribution. We consider the nonparametric Behrens-Fisher hypothesis (NBFH) under this settings. For this purpose, we study the asymptotic property of the nonparametric maximum likelihood estimator of the probability that an observation from the first distribution exceeds an observation from the second distribution. A nonparametric test for the NBFH is proposed and the asymptotic normality of the proposed test is established. The method is evaluated using simulation studies and illustrated by a set of real data from a tumorigenicity experiment.</p> / Thesis / Doctor of Philosophy (PhD)
5	Inference for Bivariate Conway-Maxwell-Poisson Distribution and Its Application in Modeling Bivariate Count Data Wang, Xinyi January 2019 (has links) In recent actuarial literature, the bivariate Poisson regression model has been found to be useful for modeling paired count data. However, the basic assumption of marginal equi-dispersion may be quite restrictive in practice. To overcome this limitation, we consider here the recently developed bivariate Conway–Maxwell–Poisson (CMP) distribution. As a distribution that allows data dispersion, the bivariate CMP distribution is a flexible distribution which includes the bivariate Poisson, bivariate Bernoulli and bivariate Geometric distributions all as special cases. We discuss inferential methods for this CMP distribution. An application to automobile insurance data demonstrates its usefulness as an alternative framework to the commonly used bivariate Poisson model. / Thesis / Master of Science (MSc) bivariate Conway–Maxwell–Poisson bivariate count data
6	An Investigation into the Determinants of Innovation in the New Zealand Biotechnology Sector Marsh, Dan January 2004 (has links) This thesis synthesises theoretical and empirical knowledge from four strands of the innovation literature and then uses this knowledge to develop a framework for analysing the determinants of innovation. The framework is tested on one part of the New Zealand economy - the biotechnology sector - an area of rapid technological change where innovation is of particular significance. Theoretical approaches to the economics of innovation and technological change are reviewed with particular reference to the neo-classical, endogenous growth, evolutionary and systems of innovation approaches. Alternative methods of measuring innovation output and innovation rate are also discussed. This is followed by a series of hypotheses regarding the determinants of innovation and a review of their place in the innovation literature. The thesis includes a detailed description of the New Zealand biotechnology sector based on a re-analysis of the first comprehensive (1998/99) survey of biotechnology in New Zealand, data from an original (2002) survey conducted by the author, data from interviews with senior management in a sample of biotechnology firms and a detailed review of secondary sources. This material is used in chapter 5 to address the question 'Does New Zealand have an innovation system for biotechnology?' Count data regression models and data from the 1998/99 and 2002 surveys are then used to test the framework's innovation hypotheses. Hypothesis testing focuses on the effects of several determinants (firm size, firm type, conduct of R, involvement in modern biotechnology, specialisation, and alliances) on innovation output and the innovation rate. Results relating to the effect of demand, technological opportunity and appropriability are also reported. The analysis in this thesis confirms the importance of most of the innovation determinants included in the framework. It also provides a detailed examination of the biotechnology sector and empirical insights into the innovation behaviour of biotech enterprises in New Zealand. Prior to the analysis in this thesis, knowledge of the sector's parameters was very limited or absent. Innovation Biotechnology New Zealand Innovation System Count Data Alliances
7	On accommodating spatial dependence in bicycle and pedestrian injury counts by severity level Narayanamoorthy, Sriram 04 March 2013 (has links) This thesis proposes a new spatial multivariate count model to jointly analyze the traffic crash-related counts of pedestrians and bicyclists by injury severity. The modeling framework is applied to predict injury counts at a Census tract level, based on crash data from Manhattan, New York. The results highlight the need to use a multivariate modeling system for the analysis of injury counts by road-user type and injury severity level, while also accommodating spatial dependence effects in injury counts. / text Multivariate count data Spatial econometrics Crash analysis Composite marginal likelihood
8	Count models : with applications to price plans in mobile telecommunication industry Kim, Yeolib 30 November 2010 (has links) This research assesses the performance of over-dispersed Poisson regression model and negative binomial model with count data. It examines the association between price plan features of mobile phone services and the number of people who adopt the plan. Mobile service data is used to estimate the model with a sample of one million customers running from February 2006 to September 2009. Under three main categories, customer type, age, and handset price, we run the model based on price plan features. Estimates are derived from the maximum likelihood estimation (MLE) method. Root mean squared error (RMSE) is used to observe the statistical fits of all the regression models. Then, we construct four estimation and holdout samples, leaving out one, three, six, and twelve months. The estimation constitutes the in-sample (IS) and the holdout represents the out-sample (OS). By estimating the IS, we predict the OS. Root mean squared error of prediction (RMSEP) is checked to see how accurate the prediction is. Results generally suggest that academic year start (AYS), seasonality, duration of months since launch of price plan (DMLP), basic fees, rate with no discount (RND), free call minutes (FCM), free data (FD), free text messaging (FTM), free perk rating (FPR), and handset support all show significant effect. The significance occurs depending on the segment. The RMSE and RMSEP show that the over-dispersed Poisson model outperforms the negative binomial model. Further implications and limitations of the results are discussed. / text Count data Mobile telecommunication service Poisson model Negative binomial model
9	A Switching Regressions Framework for Models with Count-Valued Omni-Dispersed Outcomes: Specification, Estimation and Causal Inference Manalew, Wondimu Samuel 02 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / In this dissertation, I develop a regression-based approach to the specification and estimation of the effect of a presumed causal variable on a count-valued outcome of interest. Statistics for relevant causal inference are also derived. As an illustration and as a basis for comparing alternative parametric specifications with respect to ease of implementation, computational efficiency and statistical performance, the proposed models and estimation methods are used to analyze household fertility decisions. I estimate the effect of a counterfactually imposed additional year of wife’s education on actual family size (AFS) and desired family size (DFS) [count-valued variables]. In order to ensure the causal interpretability of the effect parameter as I define it, the underlying regression model is cast in a potential outcomes (PO) framework. The specification of the relevant data generating process (DGP) is also derived. The regression-based approach developed in the dissertation, in addition to taking explicit account of the fact that the outcome of interest is count-valued, is designed to account for potential sample selection bias due to a particular data deficiency in the count data context and to accommodate the possibility that some structural aspects of the model may vary with the value of a binary switching variable. Moreover, my approach loosens the equi-dispersion constraint [conditional mean (CM) equals conditional variance (CV)] that plagues conventional (poisson) count-outcome regression models. This is a particularly important feature of my model and method because in most contexts in empirical economics the data are either over-dispersed (CM < CV) or under-dispersed (CM > CV) – fertility models are usually characterized by the latter. Alternative count data models were discussed and compared using simulated and real data. The simulation results and estimation results using real data suggest that the estimated effects from my proposed models (models that loosen the equi-dispersion constraint, account for the sample selection, and accommodate variability in structural aspect of the models due to a switching variable) substantively differ from estimates from a conventional linear and count regression specifications. causal inference count data omni-dispersion switching regression
10	The Impact of Two-Rate Taxes on Construction in Pennsylvania Plassmann, Florenz 10 July 1997 (has links) The evaluation of policy-relevant economic research requires an ethical foundation. Classical liberal theory provides the requisite foundation for this dissertation, which uses various econometric tools to estimate the effects of shifting some of the property tax from buildings to land in 15 cities in Pennsylvania. Economic theory predicts that such a shift will lead to higher building activity. However, this prediction has been supported little by empirical evidence so far. The first part of the dissertation examines the effect of the land-building tax differential on the number of building permits that were issued in 219 municipalities in Pennsylvania between 1972 and 1994. For such count data a conventional analysis based on a continuous distribution leads to incorrect results; a discrete maximum likelihood analysis with a negative binomial distribution is more appropriate. Two models, a non-linear and a fixed effects model, are developed to examine the influence of the tax differential. Both models suggest that this influence is positive, albeit not statistically significant. Application of maximum likelihood techniques is computationally cumbersome if the assumed distribution of the data cannot be written in closed form. The negative binomial distribution is the only discrete distribution with a variance that is larger than its mean that can easily be applied, although it might not be the best approximation of the true distribution of the data. The second part of the dissertation uses a Markov Chain Monte Carlo method to examine the influence of the tax differential on the number of building permits, under the assumption that building permits are generated by a Poisson process whose parameter varies lognormally. Contrary to the analysis in the first part, the tax is shown to have a strong and significantly positive impact on the number of permits. The third part of the dissertation uses a fixed-effects weighted least squares method to estimate the effect of the tax differential on the value per building permit. The tax coefficient is not significantly different from zero. Still, the overall impact of the tax differential on the total value of construction is shown to be positive and statistically significant. / Ph. D. Land Value Tax Henry George Gibbs Sampler Count Data Liberalism

Search results