Spelling suggestions: "subject:"astatistics"" "subject:"cstatistics""
1 |
Inference and Prediction Problems for Spatial and Spatiotemporal DataCervone, Daniel Leonard 17 July 2015 (has links)
This dissertation focuses on prediction and inference problems for complex spatiotemporal systems. I explore three specific problems in this area---motivated by real data examples---and discuss the theoretical motivations for the proposed methodology, implementation details, and inference/performance on data of interest.
Chapter 1 introduces a novel time series model that improves the accuracy of lung tumor tracking for radiotherapy. Tumor tracking requires real-time, multiple-step ahead forecasting of a quasi-periodic time series recording instantaneous tumor locations. Our proposed model is a location-mixture autoregressive (LMAR) process that admits multimodal conditional distributions, fast approximate inference using the EM algorithm and accurate multiple-step ahead predictive distributions. Compared with other families of mixture autoregressive models, LMAR is easier to fit (with a smaller parameter space) and better suited to online inference and multiple-step ahead forecasting as there is no need for Monte Carlo. Against other candidate models in statistics and machine learning, our model provides superior predictive performance for clinical data.
Chapter 2 develops a stochastic process model for the spatiotemporal evolution of a basketball possession based on tracking data that records each player's exact location at 25Hz. Our model comprises of multiresolution transition kernels that simultaneously describe players' continuous motion dynamics along with their decisions, ball movements, and other discrete actions. Many such actions occur very sparsely in player $\times$ location space, so we use hierarchical models to share information across different players in the league and disjoint regions on the basketball court---a challenging problem given the scale of our data (over 400 players and 1 billion space-time observations) and the computational cost of inferential methods in spatial statistics. Our framework, in addition to offering valuable insight into individual players’ behavior and decision-making, allows us to estimate the instantaneous expected point value of an NBA possession by averaging over all possible future possession paths.
In Chapter 3, we investigate Gaussian process regression where inputs are subject to measurement error. For instance, in spatial statistics, input measurement errors occur when the geographical locations of observed data are not known exactly. Such sources of error are not special cases of ``nugget'' or microscale variation, and require alternative methods for both interpolation and parameter estimation. We discuss some theory for Kriging in this regime, as well as using Hybrid Monte Carlo to provide predictive distributions (and parameter estimates, if necessary). Through simulation study and analysis of northern hemipshere temperature data from the summer of 2011, we show that appropriate methods for incorporating location measurement error are essential to reliable inference in this regime. / Statistics
|
2 |
Methods for Effectively Combining Group- and Individual-Level DataSmoot, Elizabeth 17 July 2015 (has links)
In observational studies researchers often have access to multiple sources of information but ultimately choose to apply well-established statistical methods that do not take advantage of the full range of information available. In this dissertation I discuss three methods that are able to incorporate this additional data and show how using each improves the quality of the analysis.
First, in Chapters 1 and 2, I focus on methods for improving estimator efficiency in studies in which both population (group) and individual-level data is available. In such settings, the hybrid design for ecological inference efficiently combines the two sources of information; however, in practice, maximizing the likelihood is often computationally intractable. I propose and develop an alternative, computationally efficient representation of the hybrid likelihood. I then demonstrate that this approximation incurs no penalty in terms of increased bias or reduced efficiency.
Second, in Chapters 3 and 4, I highlight the problem of applying standard analyses to outcome-dependent sampling schemes in settings in which study units are cluster-correlated. I demonstrate that incorporating known outcome totals into the likelihood via inverse probability weights results in valid estimation and inference. I further discuss the applicability of outcome-dependent sampling schemes in resource-limited settings, specifically to the analysis of national ART programs in sub-Saharan Africa. I propose the cluster-stratified case-control study as a valid and logistically reasonable study design in such resource-poor settings, discuss balanced versus unbalanced sampling techniques, and address the practical trade-off between logistic considerations and statistical efficiency of cluster-stratified case-control versus case-control studies.
Finally, in Chapter 5, I demonstrate the benefit of incorporating the full-range of possible outcomes into an observational data analysis, as opposed to running the analysis on a pre-selected set of outcomes. Testing all possible outcomes for associations with the exposure inherently incorporates negative controls into the analysis and further validates a study's statistically significant results. I apply this technique to an investigation of the relationship between particulate air pollution and hospital admission causes. / Biostatistics
|
3 |
Extensions of Randomization-Based Methods for Causal InferenceLee, Joseph Jiazong 17 July 2015 (has links)
In randomized experiments, the random assignment of units to treatment groups justifies many of the traditional analysis methods for evaluating causal effects. Specifying subgroups of units for further examination after observing outcomes, however, may partially nullify any advantages of randomized assignment when data are analyzed naively. Some previous statistical literature has treated all post-hoc analyses homogeneously as entirely invalid and thus uninterpretable. Alternative analysis methods and the extent of the validity of such analyses remain largely unstudied. Here Chapter 1 proposes a novel, randomization-based method that generates valid post-hoc subgroup p-values, provided we know exactly how the subgroups were constructed. If we do not know the exact subgrouping procedure, our method may still place helpful bounds on the significance level of estimated effects. Chapter 2 extends the proposed methodology to generate valid posterior predictive p-values for partially post-hoc subgroup analyses, i.e., analyses that compare existing experimental data --- from which a subgroup specification is derived --- to new, subgroup-only data. Both chapters are motivated by pharmaceutical examples in which subgroup analyses played pivotal and controversial roles. Chapter 3 extends our randomization-based methodology to more general randomized experiments with multiple testing and nuisance unknowns. The results are valid familywise tests that are doubly advantageous, in terms of statistical power, over traditional methods. We apply our methods to data from the United States Job Training Partnership Act (JTPA) Study, where our analyses lead to different conclusions regarding the significance of estimated JTPA effects. In all chapters, we investigate the operating characteristics and demonstrate the advantages of our methods through series of simulations. / Statistics
|
4 |
Ordinal Outcome Prediction and Treatment Selection in Personalized MedicineShen, Yuanyuan 01 May 2017 (has links)
In personalized medicine, two important tasks are predicting disease risk and selecting appropriate treatments for individuals based on their baseline information. The dissertation focuses on providing improved risk prediction for ordinal outcome data and proposing score-based test to identify informative markers for treatment selection. In Chapter 1, we take up the first problem and propose a disease risk prediction model for ordinal outcomes. Traditional ordinal outcome models leave out intermediate models which may lead to suboptimal prediction performance; they also don't allow for non-linear covariate effects. To overcome these, a continuation ratio kernel machine (CRKM) model is proposed both to let the data reveal the underlying model and to capture potential non-linearity effect among predictors, so that the prediction accuracy is maximized. In Chapter 2, we seek to develop a kernel machine (KM) score test that can efficiently identify markers that are predictive of treatment difference. This new approach overcomes the shortcomings of the standard Wald test, which is scale-dependent and only take into account linear effect among predictors. To do this, we propose a model-free score test statistics and implement the KM framework. Simulations and real data applications demonstrated the advantage of our methods over the Wald test. In Chapter 3, based on the procedure proposed in Chapter 2, we further add sparsity assumption on the predictors to take into account the real world problem of sparse signal. We incorporate the generalized higher criticism (GHC) to threshold the signals in a group and maintain a high detecting power. A comprehensive comparison of the procedures in Chapter 2 and Chapter 3 demonstrated the advantages and disadvantages of difference procedures under different scenarios. / Biostatistics
|
5 |
Essays in Causal Inference and Public PolicyFeller, Avi Isaac 17 July 2015 (has links)
This dissertation addresses statistical methods for understanding treatment effect variation in randomized experiments, both in terms of variation across pre-treatment covariates and variation across post-randomization intermediate outcomes. These methods are then applied to data from the National Head Start Impact Study (HSIS), a large-scale randomized evaluation of the Federally funded preschool program, which has become an important part of the policy debate in early childhood education.
Chapter 2 proposes a randomization-based approach for testing for the presence of treatment effect variation not explained by observed covariates. The key challenge in using this approach is the fact that the average treatment effect, generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply this method to the HSIS and find that there is indeed significant unexplained treatment effect variation.
Chapter 3 leverages model-based principal stratification to assess treatment effect variation across an intermediate outcome in the HSIS. In particular, we estimate differential impacts of Head Start by alternative care setting, the care that children would receive in the absence of the offer to enroll in Head Start. We find strong, positive short-term effects of Head Start on receptive vocabulary for those Compliers who would otherwise be in home-based care. By contrast, we find no meaningful impact of Head Start on vocabulary for those Compliers who would otherwise be in other center-based care. Our findings suggest that alternative care type is a potentially important source of variation in Head Start.
Chapter 4 reviews the literature on the use of principal score methods, which rely on predictive covariates rather than outcomes for estimating principal causal effects. We clarify the role of the Principal Ignorability assumption in this approach and show that there are in fact two versions: Strong and Weak Principal Ignorability. We then explore several proposed in the literature and assess their finite sample properties via simulation. Finally, we propose some extensions to the case of two-sided noncompliance and apply these ideas to the HSIS, finding mixed results. / Statistics
|
6 |
Exploring the Role of Randomization in Causal InferenceDing, Peng 17 July 2015 (has links)
This manuscript includes three topics in causal inference, all of which are under the randomization inference framework (Neyman, 1923; Fisher, 1935a; Rubin, 1978). This manuscript contains three self-contained chapters.
Chapter 1. Under the potential outcomes framework, causal effects are defined as comparisons between potential outcomes under treatment and control. To infer causal effects from randomized experiments, Neyman proposed to test the null hypothesis of zero average causal effect (Neyman’s null), and Fisher proposed to test the null hypothesis of zero individual causal effect (Fisher’s null). Although the subtle difference between Neyman’s null and Fisher’s null has caused lots of controversies and confusions for both theoretical and practical statisticians, a careful comparison between the two approaches has been lacking in the literature for more than eighty years. I fill in this historical gap by making a theoretical comparison between them and highlighting an intriguing paradox that has not been recognized by previous re- searchers. Logically, Fisher’s null implies Neyman’s null. It is therefore surprising that, in actual completely randomized experiments, rejection of Neyman’s null does not imply rejection of Fisher’s null for many realistic situations, including the case with constant causal effect. Furthermore, I show that this paradox also exists in other commonly-used experiments, such as stratified experiments, matched-pair experiments, and factorial experiments. Asymptotic analyses, numerical examples, and real data examples all support this surprising phenomenon. Besides its historical and theoretical importance, this paradox also leads to useful practical implications for modern researchers.
Chapter 2. Causal inference in completely randomized treatment-control studies with binary outcomes is discussed from Fisherian, Neymanian and Bayesian perspectives, using the potential outcomes framework. A randomization-based justification of Fisher’s exact test is provided. Arguing that the crucial assumption of constant causal effect is often unrealistic, and holds only for extreme cases, some new asymptotic and Bayesian inferential procedures are proposed. The proposed procedures exploit the intrinsic non-additivity of unit-level causal effects, can be applied to linear and non- linear estimands, and dominate the existing methods, as verified theoretically and also through simulation studies.
Chapter 3. Recent literature has underscored the critical role of treatment effect variation in estimating and understanding causal effects. This approach, however, is in contrast to much of the foundational research on causal inference; Neyman, for example, avoided such variation through his focus on the average treatment effect and his definition of the confidence interval. In this chapter, I extend the Ney- manian framework to explicitly allow both for treatment effect variation explained by covariates, known as the systematic component, and for unexplained treatment effect variation, known as the idiosyncratic component. This perspective enables es- timation and testing of impact variation without imposing a model on the marginal distributions of potential outcomes, with the workhorse approach of regression with interaction terms being a special case. My approach leads to two practical results.
First, I combine estimates of systematic impact variation with sharp bounds on over- all treatment variation to obtain bounds on the proportion of total impact variation explained by a given model—this is essentially an R2 for treatment effect variation. Second, by using covariates to partially account for the correlation of potential out- comes problem, I exploit this perspective to sharpen the bounds on the variance of the average treatment effect estimate itself. As long as the treatment effect varies across observed covariates, the resulting bounds are sharper than the current sharp bounds in the literature. I apply these ideas to a large randomized evaluation in educational research, showing that these results are meaningful in practice. / Statistics
|
7 |
Three Aspects of Biostatistical Learning TheoryNeykov, Matey 17 July 2015 (has links)
In the present dissertation we consider three classical problems in biostatistics and statistical learning - classification, variable selection and statistical inference.
Chapter 2 is dedicated to multi-class classification. We characterize a class of loss functions which we deem relaxed Fisher consistent, whose local minimizers not only recover the Bayes rule but also the exact conditional class probabilities. Our class encompasses previously studied classes of loss-functions, and includes non-convex functions, which are known to be less susceptible to outliers. We propose a generic greedy functional gradient-descent minimization algorithm for boosting weak learners, which works with any loss function in our class. We show that the boosting algorithm achieves geometric rate of convergence in the case of a convex loss. In addition we provide numerical studies and a real data example which serve to illustrate that the algorithm performs well in practice.
In Chapter 3, we provide insights on the behavior of sliced inverse regression in a high-dimensional setting under a single index model. We analyze two algorithms: a thresholding based algorithm known as diagonal thresholding and an L1 penalization algorithm - semidefinite programming, and show that they achieve optimal (up to a constant) sample size in terms of support recovery in the case of standard Gaussian predictors. In addition, we look into the performance of the linear regression LASSO in single index models with correlated Gaussian designs. We show that under certain restrictions on the covariance and signal, the linear regression LASSO can also enjoy optimal sample size in terms of support recovery. Our analysis extends existing results on LASSO's variable selection capabilities for linear models.
Chapter 4 develops general inferential framework for testing and constructing confidence intervals for high-dimensional estimating equations. Such framework has a variety of applications and allows us to provide tests and confidence regions for parameters estimated by algorithms such as the Dantzig Selector, CLIME and LDP among others, non of which has been previously equipped with inferential procedures. / Biostatistics
|
8 |
On Causal Inference for Ordinal OutcomesLu, Jiannan 04 December 2015 (has links)
This dissertation studies the problem of causal inference for ordinal outcomes. Chapter 1 focuses on the sharp null hypothesis of no treatment effect on all experimental units, and develops a systematic procedure for closed-form construction of sequences of alternative hypotheses in increasing orders of their departures from the sharp null hypothesis. The resulted construction procedure helps assessing the powers of randomization tests with ordinal outcomes. Chapter 2 proposes two new causal parameters, i.e., the probabilities that the treatment is beneficial and strictly beneficial for the experimental units, and derives their sharp bounds using only the marginal distributions, without imposing any assumptions on the joint distribution of the potential outcomes. Chapter 3 generalizes the framework in Chapter 2 to address noncompliance. / Statistics
|
9 |
Topics in Bayesian Inference for Causal EffectsGarcia Horton, Viviana 04 December 2015 (has links)
This manuscript addresses two topics in Bayesian inference for causal effects.
1) Treatment noncompliance is frequent in clinical trials, and because the treatment actually received may be different from that assigned, comparisons between groups as randomized will no longer assess the effect of the treatment received.
To address this complication, we create latent subgroups based on the potential outcomes of treatment received and focus on the subgroup of compliers, where under certain assumptions the estimands of causal effects of assignment can be interpreted as causal effects of receipt of treatment.
We propose estimands of causal effects for right-censored time-to event endpoints, and discuss a framework to estimate those causal effects that relies on modeling survival times as parametric functions of pre-treatment variables.
We demonstrate a Bayesian estimation strategy that multiply imputes the missing data using posterior predictive distributions using a randomized clinical trial involving breast cancer patients.
Finally, we establish a connection with the commonly used parametric proportional hazards and accelerated failure time models, and briefly discuss the consequences of relaxing the assumption of independent censoring.
2) Bayesian inference for causal effects based on data obtained from ignorable assignment mechanisms can be sensitive to the model specified for the data.
Ignorability is defined with respect to specific models for an assignment mechanism and data, which we call the ``true'' generating data models, generally unknown to the statistician; these, in turn, determine a true posterior distribution for a causal estimand of interest.
On the other hand, the statistician poses a set of models to conduct the analysis, which we call the ``statistician's'' models; a posterior distribution for the causal estimand can be obtained assuming these models.
Let $\Delta_M$ denote the difference between the true models and the statistician's models, and let $\Delta_D$ denote the difference between the true posterior distribution and the statistician's posterior distribution (for a specific estimand).
For fixed $\Delta_M$ and fixed sample size, $\Delta_D$ varies more with data-dependent assignment mechanisms than with data-free assignment mechanisms.
We illustrate this through a sequence of examples of $\Delta_M$, and
under various ignorable assignment mechanisms, namely, complete randomization design, rerandomization design, and the finite selection model design.
In each case, we create the 95\% posterior interval for an estimand under a statistician's model, and then compute its coverage probability for the correct posterior distribution; this Bayesian coverage probability is our choice of measure $\Delta_D$.
The objective of these examples is to provide insights into the ranges of data models for which Bayesian inference for causal effects from datasets obtained through ignorable assignment mechanisms is approximately valid from the Bayesian perspective, and how these validities are influenced by data-dependent assignment mechanisms. / Statistics
|
10 |
`Time for a New Angle!': Unravel the Mystery of Split-Plot Designs via the Potential Outcomes PrismZhao, Anqi 25 July 2017 (has links)
This manuscript investigates two different approaches, namely the Neymanian randomization based (Neyman, 1923) method and the Bayesian model based (Rubin, 1978) method, towards the causal inference for 2-by-2 split-plot designs (Jones and Nachtsheim, 2009), both under the potential outcomes framework (Neyman, 1923; Rubin, 1974, 1978, 2005).
Chapter 1 -- Chapter 5. Given two 2-level factors of interest, a 2-by-2 split-plot design (a) takes each of the 2-by-2 = 4 possible factorial combinations as a treatment, (b) identifies one factor as 'whole-plot,' (c) divides the experimental units into blocks, and (d) assigns the treatments in such a way that all units within the same block receive the same level of the whole-plot factor. Assuming the potential outcomes framework, we propose in Chapters 1 — 5 a randomization-based estimation procedure for causal inference under such designs. Sampling variances of the point estimates are derived in closed form as linear combinations of the between- and within-block covariances of the potential outcomes. Results are compared to those under complete randomizations as measures of design efficiency. Interval estimates are constructed based on conservative estimates of the sampling variances, and the frequency coverage properties evaluated via simulation. Superiority over existing model-based alternatives is reported under a variety of settings for both binary and continuous outcomes.
Chapter 6. Causal inference compares the differences in outcomes over a particular set of experiment units. Whereas the randomization-based Neymanian inference focuses on the experimental units directly involved in the study, the introduction of Bayesian inferential framework provides a principled way to extend such finite population concerns to the super-population (Rubin, 1978). We outline in this chapter the explicit procedure for analyzing 2-by-2 split-plot designs under this framework, and illustrate the various technical issues in the actual implementation via examples. / Statistics
|
Page generated in 0.1053 seconds