Spelling suggestions: "subject:"estatistics"" "subject:"cstatistics""
471 |
Robust Principal Component AnalysisKpamegan, Neil Racheed 15 May 2018 (has links)
<p> In multivariate analysis, principal component analysis is a widely popular method which is used in many different fields. Though it has been extensively shown to work well when data follows multivariate normality, classical PCA suffers when data is heavy-tailed. Using PCA with the assumption that the data follows a stable distribution, we will show through simulations that a new method is better. We show the modified PCA can be used for heavy-tailed data and that we can more accurately estimate the correct number of components compared to classical PCA and more accurately identify the subspace spanned by the important components.</p><p>
|
472 |
Inferences on Gamma Distributions| Uncensored and Censored CasesWang, Xiao 16 September 2017 (has links)
<p> Inferential methods for constructing an upper confidence limit for an upper percentile and for finding confidence intervals for a gamma distribution based on samples with multiple detection limits are proposed. The proposed methods are based on the fiducial approach. Computational algorithms are provided and numerical results are given to assess the performance of the proposed methods, and to make comparisons with competing procedures. It is noted that the fiducial approach provides accurate inference for estimating the gamma mean and percentiles. In general, the fiducial approach is very satisfactory and is applicable to small sample sizes. </p><p> We also derived the likelihood ratio test (LRT) statistics for testing equality of shape parameters, scale parameters, several gamma means, and homogeneity of several independent gamma distributions. Our extensive simulation studies for each of the testing problems indicate that the percentiles of the null distributions of the LRT statistics are affected mainly by the number of distributions to be compared and the sample sizes, but not much on the parameters. These simulation results simply imply that the null distributions depend on the parameters only weakly. The simulation studies also showed that the procedures are very satisfactory in terms of coverage probabilities and powers. </p><p> Illustrative examples with practical data sets and simulated data sets are given.</p><p>
|
473 |
Predicting the Success of Running Back Prospects in the National Football LeagueMerritt, Kevin M. 06 September 2017 (has links)
<p> National Football League team’s analysts use statistics in a multitude of ways, including game planning, game day rosters, and incoming talent evaluation. Focusing on the running back position, we attempt to improve upon models designed to predict the future success of incoming collegiate players while introducing some models of our own. Focusing on running backs drafted from 1999 to 2013, we use data from the player’s college career, combine workouts, pro day workouts, and physical measurements. Using linear regression, recursive partitioning decision trees, principal component analysis, zero-inflated negative binomial regression, hurdle negative binomial regression, and zero-inflated truncated normal regression, we develop models for three different success criteria: a weighted combination of games played and started, yards per rushing attempt, and career yards from scrimmage.</p><p>
|
474 |
Statistical Modeling and Analysis for Biomedical ApplicationsHo, Christine 07 July 2017 (has links)
<p> This dissertation discusses approaches to two different applied statistical challenges arising from the fields of genomics and biomedical research. The first takes advantage of the richness of whole genome sequencing data, which can uncover both regions of chromosomal aberration and highly specific information on point mutations. We propose a method to reconstruct parts of a tumor's history of chromosomal aberration using only data from a single time-point. We provide an application of the method, which was the first of its kind, to data from eight patients with squamous cell skin cancer, in which we were able to find that knockout of the tumor suppressor gene TP53 occur early in that cancer type. </p><p> While the first chapter highlights what's possible with a deep analysis of data from a single patient, the second chapter of this dissertation looks at the opposite situation, aggregating data from several patients to identify gene expression signals for disease phenotypes. In this chapter, we provide a method for hierarchical multilabel classification from several separate classifiers for each node in the hierarchy. The first calls produced by our method improve upon the state-of-the-art, resulting in better performance in the early part of the precision-recall curve. We apply the method to disease classifiers constructed from public microarray data, and whose relationships to each other are given in a known medical hierarchy.</p>
|
475 |
Inference and Prediction Problems for Spatial and Spatiotemporal DataCervone, Daniel Leonard 17 July 2015 (has links)
This dissertation focuses on prediction and inference problems for complex spatiotemporal systems. I explore three specific problems in this area---motivated by real data examples---and discuss the theoretical motivations for the proposed methodology, implementation details, and inference/performance on data of interest.
Chapter 1 introduces a novel time series model that improves the accuracy of lung tumor tracking for radiotherapy. Tumor tracking requires real-time, multiple-step ahead forecasting of a quasi-periodic time series recording instantaneous tumor locations. Our proposed model is a location-mixture autoregressive (LMAR) process that admits multimodal conditional distributions, fast approximate inference using the EM algorithm and accurate multiple-step ahead predictive distributions. Compared with other families of mixture autoregressive models, LMAR is easier to fit (with a smaller parameter space) and better suited to online inference and multiple-step ahead forecasting as there is no need for Monte Carlo. Against other candidate models in statistics and machine learning, our model provides superior predictive performance for clinical data.
Chapter 2 develops a stochastic process model for the spatiotemporal evolution of a basketball possession based on tracking data that records each player's exact location at 25Hz. Our model comprises of multiresolution transition kernels that simultaneously describe players' continuous motion dynamics along with their decisions, ball movements, and other discrete actions. Many such actions occur very sparsely in player $\times$ location space, so we use hierarchical models to share information across different players in the league and disjoint regions on the basketball court---a challenging problem given the scale of our data (over 400 players and 1 billion space-time observations) and the computational cost of inferential methods in spatial statistics. Our framework, in addition to offering valuable insight into individual players’ behavior and decision-making, allows us to estimate the instantaneous expected point value of an NBA possession by averaging over all possible future possession paths.
In Chapter 3, we investigate Gaussian process regression where inputs are subject to measurement error. For instance, in spatial statistics, input measurement errors occur when the geographical locations of observed data are not known exactly. Such sources of error are not special cases of ``nugget'' or microscale variation, and require alternative methods for both interpolation and parameter estimation. We discuss some theory for Kriging in this regime, as well as using Hybrid Monte Carlo to provide predictive distributions (and parameter estimates, if necessary). Through simulation study and analysis of northern hemipshere temperature data from the summer of 2011, we show that appropriate methods for incorporating location measurement error are essential to reliable inference in this regime. / Statistics
|
476 |
Methods for Effectively Combining Group- and Individual-Level DataSmoot, Elizabeth 17 July 2015 (has links)
In observational studies researchers often have access to multiple sources of information but ultimately choose to apply well-established statistical methods that do not take advantage of the full range of information available. In this dissertation I discuss three methods that are able to incorporate this additional data and show how using each improves the quality of the analysis.
First, in Chapters 1 and 2, I focus on methods for improving estimator efficiency in studies in which both population (group) and individual-level data is available. In such settings, the hybrid design for ecological inference efficiently combines the two sources of information; however, in practice, maximizing the likelihood is often computationally intractable. I propose and develop an alternative, computationally efficient representation of the hybrid likelihood. I then demonstrate that this approximation incurs no penalty in terms of increased bias or reduced efficiency.
Second, in Chapters 3 and 4, I highlight the problem of applying standard analyses to outcome-dependent sampling schemes in settings in which study units are cluster-correlated. I demonstrate that incorporating known outcome totals into the likelihood via inverse probability weights results in valid estimation and inference. I further discuss the applicability of outcome-dependent sampling schemes in resource-limited settings, specifically to the analysis of national ART programs in sub-Saharan Africa. I propose the cluster-stratified case-control study as a valid and logistically reasonable study design in such resource-poor settings, discuss balanced versus unbalanced sampling techniques, and address the practical trade-off between logistic considerations and statistical efficiency of cluster-stratified case-control versus case-control studies.
Finally, in Chapter 5, I demonstrate the benefit of incorporating the full-range of possible outcomes into an observational data analysis, as opposed to running the analysis on a pre-selected set of outcomes. Testing all possible outcomes for associations with the exposure inherently incorporates negative controls into the analysis and further validates a study's statistically significant results. I apply this technique to an investigation of the relationship between particulate air pollution and hospital admission causes. / Biostatistics
|
477 |
Extensions of Randomization-Based Methods for Causal InferenceLee, Joseph Jiazong 17 July 2015 (has links)
In randomized experiments, the random assignment of units to treatment groups justifies many of the traditional analysis methods for evaluating causal effects. Specifying subgroups of units for further examination after observing outcomes, however, may partially nullify any advantages of randomized assignment when data are analyzed naively. Some previous statistical literature has treated all post-hoc analyses homogeneously as entirely invalid and thus uninterpretable. Alternative analysis methods and the extent of the validity of such analyses remain largely unstudied. Here Chapter 1 proposes a novel, randomization-based method that generates valid post-hoc subgroup p-values, provided we know exactly how the subgroups were constructed. If we do not know the exact subgrouping procedure, our method may still place helpful bounds on the significance level of estimated effects. Chapter 2 extends the proposed methodology to generate valid posterior predictive p-values for partially post-hoc subgroup analyses, i.e., analyses that compare existing experimental data --- from which a subgroup specification is derived --- to new, subgroup-only data. Both chapters are motivated by pharmaceutical examples in which subgroup analyses played pivotal and controversial roles. Chapter 3 extends our randomization-based methodology to more general randomized experiments with multiple testing and nuisance unknowns. The results are valid familywise tests that are doubly advantageous, in terms of statistical power, over traditional methods. We apply our methods to data from the United States Job Training Partnership Act (JTPA) Study, where our analyses lead to different conclusions regarding the significance of estimated JTPA effects. In all chapters, we investigate the operating characteristics and demonstrate the advantages of our methods through series of simulations. / Statistics
|
478 |
Ordinal Outcome Prediction and Treatment Selection in Personalized MedicineShen, Yuanyuan 01 May 2017 (has links)
In personalized medicine, two important tasks are predicting disease risk and selecting appropriate treatments for individuals based on their baseline information. The dissertation focuses on providing improved risk prediction for ordinal outcome data and proposing score-based test to identify informative markers for treatment selection. In Chapter 1, we take up the first problem and propose a disease risk prediction model for ordinal outcomes. Traditional ordinal outcome models leave out intermediate models which may lead to suboptimal prediction performance; they also don't allow for non-linear covariate effects. To overcome these, a continuation ratio kernel machine (CRKM) model is proposed both to let the data reveal the underlying model and to capture potential non-linearity effect among predictors, so that the prediction accuracy is maximized. In Chapter 2, we seek to develop a kernel machine (KM) score test that can efficiently identify markers that are predictive of treatment difference. This new approach overcomes the shortcomings of the standard Wald test, which is scale-dependent and only take into account linear effect among predictors. To do this, we propose a model-free score test statistics and implement the KM framework. Simulations and real data applications demonstrated the advantage of our methods over the Wald test. In Chapter 3, based on the procedure proposed in Chapter 2, we further add sparsity assumption on the predictors to take into account the real world problem of sparse signal. We incorporate the generalized higher criticism (GHC) to threshold the signals in a group and maintain a high detecting power. A comprehensive comparison of the procedures in Chapter 2 and Chapter 3 demonstrated the advantages and disadvantages of difference procedures under different scenarios. / Biostatistics
|
479 |
Essays in Causal Inference and Public PolicyFeller, Avi Isaac 17 July 2015 (has links)
This dissertation addresses statistical methods for understanding treatment effect variation in randomized experiments, both in terms of variation across pre-treatment covariates and variation across post-randomization intermediate outcomes. These methods are then applied to data from the National Head Start Impact Study (HSIS), a large-scale randomized evaluation of the Federally funded preschool program, which has become an important part of the policy debate in early childhood education.
Chapter 2 proposes a randomization-based approach for testing for the presence of treatment effect variation not explained by observed covariates. The key challenge in using this approach is the fact that the average treatment effect, generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply this method to the HSIS and find that there is indeed significant unexplained treatment effect variation.
Chapter 3 leverages model-based principal stratification to assess treatment effect variation across an intermediate outcome in the HSIS. In particular, we estimate differential impacts of Head Start by alternative care setting, the care that children would receive in the absence of the offer to enroll in Head Start. We find strong, positive short-term effects of Head Start on receptive vocabulary for those Compliers who would otherwise be in home-based care. By contrast, we find no meaningful impact of Head Start on vocabulary for those Compliers who would otherwise be in other center-based care. Our findings suggest that alternative care type is a potentially important source of variation in Head Start.
Chapter 4 reviews the literature on the use of principal score methods, which rely on predictive covariates rather than outcomes for estimating principal causal effects. We clarify the role of the Principal Ignorability assumption in this approach and show that there are in fact two versions: Strong and Weak Principal Ignorability. We then explore several proposed in the literature and assess their finite sample properties via simulation. Finally, we propose some extensions to the case of two-sided noncompliance and apply these ideas to the HSIS, finding mixed results. / Statistics
|
480 |
Exploring the Role of Randomization in Causal InferenceDing, Peng 17 July 2015 (has links)
This manuscript includes three topics in causal inference, all of which are under the randomization inference framework (Neyman, 1923; Fisher, 1935a; Rubin, 1978). This manuscript contains three self-contained chapters.
Chapter 1. Under the potential outcomes framework, causal effects are defined as comparisons between potential outcomes under treatment and control. To infer causal effects from randomized experiments, Neyman proposed to test the null hypothesis of zero average causal effect (Neyman’s null), and Fisher proposed to test the null hypothesis of zero individual causal effect (Fisher’s null). Although the subtle difference between Neyman’s null and Fisher’s null has caused lots of controversies and confusions for both theoretical and practical statisticians, a careful comparison between the two approaches has been lacking in the literature for more than eighty years. I fill in this historical gap by making a theoretical comparison between them and highlighting an intriguing paradox that has not been recognized by previous re- searchers. Logically, Fisher’s null implies Neyman’s null. It is therefore surprising that, in actual completely randomized experiments, rejection of Neyman’s null does not imply rejection of Fisher’s null for many realistic situations, including the case with constant causal effect. Furthermore, I show that this paradox also exists in other commonly-used experiments, such as stratified experiments, matched-pair experiments, and factorial experiments. Asymptotic analyses, numerical examples, and real data examples all support this surprising phenomenon. Besides its historical and theoretical importance, this paradox also leads to useful practical implications for modern researchers.
Chapter 2. Causal inference in completely randomized treatment-control studies with binary outcomes is discussed from Fisherian, Neymanian and Bayesian perspectives, using the potential outcomes framework. A randomization-based justification of Fisher’s exact test is provided. Arguing that the crucial assumption of constant causal effect is often unrealistic, and holds only for extreme cases, some new asymptotic and Bayesian inferential procedures are proposed. The proposed procedures exploit the intrinsic non-additivity of unit-level causal effects, can be applied to linear and non- linear estimands, and dominate the existing methods, as verified theoretically and also through simulation studies.
Chapter 3. Recent literature has underscored the critical role of treatment effect variation in estimating and understanding causal effects. This approach, however, is in contrast to much of the foundational research on causal inference; Neyman, for example, avoided such variation through his focus on the average treatment effect and his definition of the confidence interval. In this chapter, I extend the Ney- manian framework to explicitly allow both for treatment effect variation explained by covariates, known as the systematic component, and for unexplained treatment effect variation, known as the idiosyncratic component. This perspective enables es- timation and testing of impact variation without imposing a model on the marginal distributions of potential outcomes, with the workhorse approach of regression with interaction terms being a special case. My approach leads to two practical results.
First, I combine estimates of systematic impact variation with sharp bounds on over- all treatment variation to obtain bounds on the proportion of total impact variation explained by a given model—this is essentially an R2 for treatment effect variation. Second, by using covariates to partially account for the correlation of potential out- comes problem, I exploit this perspective to sharpen the bounds on the variance of the average treatment effect estimate itself. As long as the treatment effect varies across observed covariates, the resulting bounds are sharper than the current sharp bounds in the literature. I apply these ideas to a large randomized evaluation in educational research, showing that these results are meaningful in practice. / Statistics
|
Page generated in 0.0878 seconds