31 |
Marginal false discovery rate approaches to inference on penalized regression modelsMiller, Ryan 01 August 2018 (has links)
Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.
|
32 |
Semiparametric regression analysis of zero-inflated dataLiu, Hai 01 July 2009 (has links)
Zero-inflated data abound in ecological studies as well as in other scientific and quantitative fields. Nonparametric regression with zero-inflated response may be studied via the zero-inflated generalized additive model (ZIGAM). ZIGAM assumes that the conditional distribution of the response variable belongs to the zero-inflated 1-parameter exponential family which is a probabilistic mixture of the zero atom and the 1-parameter exponential family, where the zero atom accounts for an excess of zeroes in the data. We propose the constrained zero-inflated generalized additive model (COZIGAM) for analyzing zero-inflated data, with the further assumption that the probability of non-zero-inflation is some monotone function of the (non-zero-inflated) exponential family distribution mean. When the latter assumption obtains, the new approach provides a unified framework for modeling zero-inflated data, which is more parsimonious and efficient than the unconstrained ZIGAM. We develop an iterative algorithm for model estimation based on the penalized likelihood approach, and derive formulas for constructing confidence intervals of the maximum penalized likelihood estimator. Some asymptotic properties including the consistency of the regression function estimator and the limiting distribution of the parametric estimator are derived. We also propose a Bayesian model selection criterion for choosing between the unconstrained and the constrained ZIGAMs. We consider several useful extensions of the COZIGAM, including imposing additive-component-specific proportional and partial constraints, and incorporating threshold effects to account for regime shift phenomena. The new methods are illustrated with both simulated data and real applications. An R package COZIGAM has been developed for model fitting and model selection with zero-inflated data.
|
33 |
Bayesian Sparse Learning for High Dimensional DataShi, Minghui January 2011 (has links)
<p>In this thesis, we develop some Bayesian sparse learning methods for high dimensional data analysis. There are two important topics that are related to the idea of sparse learning -- variable selection and factor analysis. We start with Bayesian variable selection problem in regression models. One challenge in Bayesian variable selection is to search the huge model space adequately, while identifying high posterior probability regions. In the past decades, the main focus has been on the use of Markov chain Monte Carlo (MCMC) algorithms for these purposes. In the first part of this thesis, instead of using MCMC, we propose a new computational approach based on sequential Monte Carlo (SMC), which we refer to as particle stochastic search (PSS). We illustrate PSS through applications to linear regression and probit models.</p><p>Besides the Bayesian stochastic search algorithms, there is a rich literature on shrinkage and variable selection methods for high dimensional regression and classification with vector-valued parameters, such as lasso (Tibshirani, 1996) and the relevance vector machine (Tipping, 2001). Comparing with the Bayesian stochastic search algorithms, these methods does not account for model uncertainty but are more computationally efficient. In the second part of this thesis, we generalize this type of ideas to matrix valued parameters and focus on developing efficient variable selection method for multivariate regression. We propose a Bayesian shrinkage model (BSM) and an efficient algorithm for learning the associated parameters .</p><p>In the third part of this thesis, we focus on the topic of factor analysis which has been widely used in unsupervised learnings. One central problem in factor analysis is related to the determination of the number of latent factors. We propose some Bayesian model selection criteria for selecting the number of latent factors based on a graphical factor model. As it is illustrated in Chapter 4, our proposed method achieves good performance in correctly selecting the number of factors in several different settings. As for application, we implement the graphical factor model for several different purposes, such as covariance matrix estimation, latent factor regression and classification.</p> / Dissertation
|
34 |
Penalized method based on representatives and nonparametric analysis of gap dataPark, Soyoun 14 September 2010 (has links)
When there are a large number of predictors and few observations, building a regression model to explain the behavior of a response variable such as a patient's medical condition is very challenging. This is a "p ≫n " variable selection problem encountered often in modern applied statistics and data mining. Chapter one of this thesis proposes a rigorous procedure which groups predictors into clusters of "highly-correlated" variables, selects a representative from each cluster, and uses a subset of the representatives for regression modeling. The proposed Penalized method based on Representatives (PR) extends the Lasso for the p ≫ n data and highly correlated variables, to build a sparse model practically interpretable and maintain prediction quality. Moreover, we provide the PR-Sequential Grouped Regression (PR-SGR) to make computation of the PR procedure efficient. Simulation studies show the proposed method outperforms existing methods such as the Lasso/Lars. A real-life example from a mental health diagnosis illustrates the applicability of the PR-SGR. In the second part of the thesis, we study the analysis of time-to-event data called a gap data when missing time intervals (gaps) possibly happen prior to the first observed event time. If a gap occurs prior to the first observed event, then the first observed event may or may not be the first true event. This incomplete knowledge makes the gap data different from the well-studied regular interval censored data. We propose a Non-Parametric Estimate for the Gap data (NPEG) to estimate the survival function for the first true event time, derive its analytic properties and demonstrate its performance in simulations. We also extend the Imputed Empirical Estimating method (IEE), which is an existing nonparametric method for the gap data up to one gap, to handle the gap data with multiple gaps.
|
35 |
Mixture distributions with application to microarray data analysisLynch, O'Neil 01 June 2009 (has links)
The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. In this dissertation we proposed two methods to determine differentially expressed genes. For the penalized normal mixture model (PMMM) to determine genes that are differentially expressed, we penalized both the variance and the mixing proportion parameters simultaneously. The variance parameter was penalized so that the log-likelihood will be bounded, while the mixing proportion parameter was penalized so that its estimates are not on the boundary of its parametric space. The null distribution of the likelihood ratio test statistic (LRTS) was simulated so that we could perform a hypothesis test for the number of components of the penalized normal mixture model. In addition to simulating the null distribution of the LRTS for the penalized normal mixture model, we showed that the maximum likelihood estimates were asymptotically normal, which is a first step that is necessary to prove the asymptotic null distribution of the LRTS. This result is a significant contribution to field of normal mixture model.
The modified p-value approach for detecting differentially expressed genes was also discussed in this dissertation. The modified p-value approach was implemented so that a hypothesis test for the number of components can be conducted by using the modified likelihood ratio test. In the modified p-value approach we penalized the mixing proportion so that the estimates of the mixing proportion are not on the boundary of its parametric space. The null distribution of the (LRTS) was simulated so that the number of components of the uniform beta mixture model can be determined. Finally, for both modified methods, the penalized normal mixture model and the modified p-value approach were applied to simulated and real data.
|
36 |
New Results in ell_1 Penalized RegressionRoualdes, Edward A. 01 January 2015 (has links)
Here we consider penalized regression methods, and extend on the results surrounding the l1 norm penalty. We address a more recent development that generalizes previous methods by penalizing a linear transformation of the coefficients of interest instead of penalizing just the coefficients themselves. We introduce an approximate algorithm to fit this generalization and a fully Bayesian hierarchical model that is a direct analogue of the frequentist version. A number of benefits are derived from the Bayesian persepective; most notably choice of the tuning parameter and natural means to estimate the variation of estimates – a notoriously difficult task for the frequentist formulation. We then introduce Bayesian trend filtering which exemplifies the benefits of our Bayesian version. Bayesian trend filtering is shown to be an empirically strong technique for fitting univariate, nonparametric regression. Through a simulation study, we show that Bayesian trend filtering reduces prediction error and attains more accurate coverage probabilities over the frequentist method. We then apply Bayesian trend filtering to real data sets, where our method is quite competitive against a number of other popular nonparametric methods.
|
37 |
Response Adaptive Designs in the Presence of MismeasurementLI, XUAN January 2012 (has links)
Response adaptive randomization represents a major advance in clinical trial methodology that helps balance the benefits of the collective and the benefits of the individual and improves efficiency without undermining the validity and integrity of the clinical research. Response adaptive designs use information so far accumulated from the trial to modify the randomization procedure and deliberately bias treatment allocation in order to assign more patients to the potentially better treatment. No attention has been paid to incorporating the problem of errors-in-variables in adaptive clinical trials. In this work, some important issues and methods of response adaptive design of clinical trials in the presence of mismeasurement are examined. We formulate response adaptive designs when the dichotomous response may be misclassified. We consider the optimal allocations under various objectives, investigate the asymptotically best response adaptive randomization procedure, and discuss effects of misclassification on the optimal allocation. We derive explicit expressions for the variance-penalized criterion with misclassified binary responses and propose a new target proportion of treatment allocation under the criterion. A real-life clinical trial and some related simulation results are also presented.
|
38 |
Contributions to statistical learning and its applications in personalized medicineValencia Arboleda, Carlos Felipe 16 May 2013 (has links)
This dissertation, in general, is about finding stable solutions to statistical models with very large number of parameters and to analyze their asymptotic statistical properties. In particular, it is centered in the study of regularization methods based on penalized estimation. Those procedures find an estimator that is the result of an optimization problem balancing out the fitting to the data with the plausability of the estimation. The first chapter studies a smoothness regularization estimator for an infinite dimensional parameter in an exponential family model with functional predictors. We focused on the Reproducing Kernel Hilbert space approach and show that regardless the generality of the method, minimax optimal convergence rates are achieved. In order to derive the asymptotic analysis of the estimator, we developed a simultaneous diagonalization tool for two positive definite operators: the kernel operator and the operator defined by the second Frechet derivative of the expected data t functional. By using the proposed simultaneous diagonalization tool sharper bounds on the minimax rates are obtained. The second chapter studies the statistical properties of the method of regularization using Radial Basis Functions in the context of linear inverse problems. The regularization here serves two purposes, one is creating a stable solution for the inverse problem and the other is prevent the over-fitting on the nonparametric estimation of the functional target. Different degrees for the ill-posedness in the inversion of the operator A are considered: mildly and severely ill-posed. Also, we study different types for radial basis kernels classifieded by the strength of the penalization norm: Gaussian, Multiquadrics and Spline type of kernels. The third chapter deals with the problem of Individualized Treatment Rule (ITR) and analyzes the solution of it through Discriminant Analysis. In the ITR problem, the treatment assignment is done based on the particular patient's prognosis covariates in order to maximizes some reward function. Data generated from a random clinical trial is considered. Maximizing the empirical value function is an NP-hard computational problem. We consider estimating directly the decision rule by maximizing the expected value, using a surrogate function in order to make the optimization problem computationally feasible (convex programming). Necessary and sufficient conditions for Infinite Sample Consistency on the surrogate function are found for different scenarios: binary treatment selection, treatment selection with withholding and multi-treatment selection.
|
39 |
Essays on Estimation Methods for Factor Models and Structural Equation ModelsJin, Shaobo January 2015 (has links)
This thesis which consists of four papers is concerned with estimation methods in factor analysis and structural equation models. New estimation methods are proposed and investigated. In paper I an approximation of the penalized maximum likelihood (ML) is introduced to fit an exploratory factor analysis model. Approximated penalized ML continuously and efficiently shrinks the factor loadings towards zero. It naturally factorizes a covariance matrix or a correlation matrix. It is also applicable to an orthogonal or an oblique structure. Paper II, a simulation study, investigates the properties of approximated penalized ML with an orthogonal factor model. Different combinations of penalty terms and tuning parameter selection methods are examined. Differences in factorizing a covariance matrix and factorizing a correlation matrix are also explored. It is shown that the approximated penalized ML frequently improves the traditional estimation-rotation procedure. In Paper III we focus on pseudo ML for multi-group data. Data from different groups are pooled and normal theory is used to fit the model. It is shown that pseudo ML produces consistent estimators of factor loadings and that it is numerically easier than multi-group ML. In addition, normal theory is not applicable to estimate standard errors. A sandwich-type estimator of standard errors is derived. Paper IV examines properties of the recently proposed polychoric instrumental variable (PIV) estimators for ordinal data through a simulation study. PIV is compared with conventional estimation methods (unweighted least squares and diagonally weighted least squares). PIV produces accurate estimates of factor loadings and factor covariances in the correctly specified confirmatory factor analysis model and accurate estimates of loadings and coefficient matrices in the correctly specified structure equation model. If the model is misspecified, robustness of PIV depends on model complexity, underlying distribution, and instrumental variables.
|
40 |
Response Adaptive Designs in the Presence of MismeasurementLI, XUAN January 2012 (has links)
Response adaptive randomization represents a major advance in clinical trial methodology that helps balance the benefits of the collective and the benefits of the individual and improves efficiency without undermining the validity and integrity of the clinical research. Response adaptive designs use information so far accumulated from the trial to modify the randomization procedure and deliberately bias treatment allocation in order to assign more patients to the potentially better treatment. No attention has been paid to incorporating the problem of errors-in-variables in adaptive clinical trials. In this work, some important issues and methods of response adaptive design of clinical trials in the presence of mismeasurement are examined. We formulate response adaptive designs when the dichotomous response may be misclassified. We consider the optimal allocations under various objectives, investigate the asymptotically best response adaptive randomization procedure, and discuss effects of misclassification on the optimal allocation. We derive explicit expressions for the variance-penalized criterion with misclassified binary responses and propose a new target proportion of treatment allocation under the criterion. A real-life clinical trial and some related simulation results are also presented.
|
Page generated in 0.0642 seconds