Spelling suggestions: "subject:"penalized"" "subject:"menalized""
51 |
A Concave Pairwise Fusion Approach to Clustering of Multi-Response Regression and Its Robust ExtensionsChen, Chen, 0000-0003-1175-3027 January 2022 (has links)
Solution-path convex clustering is combined with concave penalties by Ma and Huang (2017) to reduce clustering bias. Their method was introduced in the setting of single-response regression to handle heterogeneity. Such heterogeneity may come from either the regression intercepts or the regression slopes. The procedure, realized by the alternating direction method of multipliers (ADMM) algorithm, can simultaneously identify the grouping structure of observations and estimate regression coefficients.
In the first part of our work, we extend this procedure to multi-response regression. We propose models to solve cases with heterogeneity in either the regression intercepts or the regression slopes. We combine the existing gadgets of the ADMM algorithm and group-wise concave penalties to find solutions for the model. Our work improves model performance in both clustering accuracy and estimation accuracy. We also demonstrate the necessity of such extension through the fact that by utilizing information in multi-dimensional space, the performance can be greatly improved.
In the second part, we introduce robust solutions to our proposed work. We introduce two approaches to handle outliers or long-tail distributions. The first is to replace the squared loss with robust loss, among which are absolute loss and Huber loss. The second is to characterize and remove outliers' effects by a mean-shift vector. We demonstrate that these robust solutions outperform the squared loss based method when outliers are present, or the underlying distribution is long-tailed. / Statistics
|
52 |
Variable Selection and Parameter Estimation Using a Continuous and Differentiable Approximation to the L0 Penalty FunctionVanDerwerken, Douglas Nielsen 10 March 2011 (has links) (PDF)
L0 penalized likelihood procedures like Mallows' Cp, AIC, and BIC directly penalize for the number of variables included in a regression model. This is a straightforward approach to the problem of overfitting, and these methods are now part of every statistician's repertoire. However, these procedures have been shown to sometimes result in unstable parameter estimates as a result on the L0 penalty's discontinuity at zero. One proposed alternative, seamless-L0 (SELO), utilizes a continuous penalty function that mimics L0 and allows for stable estimates. Like other similar methods (e.g. LASSO and SCAD), SELO produces sparse solutions because the penalty function is non-differentiable at the origin. Because these penalized likelihoods are singular (non-differentiable) at zero, there is no closed-form solution for the extremum of the objective function. We propose a continuous and everywhere-differentiable penalty function that can have arbitrarily steep slope in a neighborhood near zero, thus mimicking the L0 penalty, but allowing for a nearly closed-form solution for the beta-hat vector. Because our function is not singular at zero, beta-hat will have no zero-valued components, although some will have been shrunk arbitrarily close thereto. We employ a BIC-selected tuning parameter used in the shrinkage step to perform zero-thresholding as well. We call the resulting vector of coefficients the ShrinkSet estimator. It is comparable to SELO in terms of model performance (selecting the truly nonzero coefficients, overall MSE, etc.), but we believe it to be more intuitive and simpler to compute. We provide strong evidence that the estimator enjoys favorable asymptotic properties, including the oracle property.
|
53 |
Performances of different estimation methods for generalized linear mixed models.Biswas, Keya 08 May 2015 (has links)
Generalized linear mixed models (GLMMs) have become extremely popular in recent years. The main computational problem in parameter estimation for GLMMs is that, in contrast to linear mixed models, closed analytical expressions for the likelihood are not available. To overcome this problem, several approaches have been proposed in the literature. For this study we have used one quasi-likelihood approach, penalized quasi-likelihood (PQL), and two integral approaches: Laplace and adaptive Gauss-Hermite quadrature (AGHQ) approximation. Our primary objective was to measure the performances of each estimation method. AGHQ approximation is more accurate than Laplace approximation, but slower. So the question is when Laplace approximation is adequate, versus when AGHQ approximation provides a significantly more accurate result. We have run two simulations using PQL, Laplace and AGHQ approximations with different quadrature points for varying random effect standard deviation (Ɵ) and number of replications per cluster. The performances of these three methods were measured base on the root mean square error (RMSE) and bias. Based on the simulated data, we have found that for both smaller values of Ɵ and small number of replications and for larger values of and for larger values of Ɵ and lager number of replications, the RMSE of PQL method is much higher than Laplace and AGHQ approximations. However, for intermediate values of Ɵ (random effect standard deviation) ranging from 0.63 to 3.98, regardless of number of replications per cluster, both Laplace and AGHQ approximations gave similar estimates. But when both number of replications and Ɵ became small, increasing quadrature points increases RMSE values indicating that Laplace approximation perform better than the AGHQ method. When random effect standard deviation is large, e.g. Ɵ=10, and number of replications is small the Laplace RMSE value is larger than that of AGHQ approximation. Increasing quadrature points decreases the RMSE values. This indicates that AGHQ performs better in this situation. The difference in RMSE between PQL vs Laplace and AGHQ vs Laplace is approximately 12% and 10% respectively.
In addition, we have tested the relative performance and the accuracy between two different packages of R (lme4, glmmML) and SAS (PROC GLIMMIX) based on real data. Our results suggested that all of them perform well in terms of accuracy, precision and convergence rates. In most cases, glmmML was found to be much faster than lme4 package and SAS. The only difference was found in the Contraception data where the required computational time for both R packages was exactly the same. The difference in required computational times for these two platforms decreases as the number of quadrature points increases. / Thesis / Master of Science (MSc)
|
54 |
Modeling of High-Dimensional Clinical Longitudinal Oxygenation Data from Retinopathy of PrematurityMargevicius, Seunghee P. 01 June 2018 (has links)
No description available.
|
55 |
Semi-Parametric Test Based on Spline Smoothing for Genetic Association Studies Under Stratified PopulationsZhang, Qi 03 April 2007 (has links)
No description available.
|
56 |
Two Essays on Single-index ModelsWu, Zhou 24 September 2008 (has links)
No description available.
|
57 |
Essays on High-dimensional Nonparametric Smoothing and Its Applications to Asset PricingWu, Chaojiang 25 October 2013 (has links)
No description available.
|
58 |
STATISTICAL METHODS FOR VARIABLE SELECTION IN THE CONTEXT OF HIGH-DIMENSIONAL DATA: LASSO AND EXTENSIONSYang, Xiao Di 10 1900 (has links)
<p>With the advance of technology, the collection and storage of data has become routine. Huge amount of data are increasingly produced from biological experiments. the advent of DNA microarray technologies has enabled scientists to measure expressions of tens of thousands of genes simultaneously. Single nucleotide polymorphism (SNP) are being used in genetic association with a wide range of phenotypes, for example, complex diseases. These high-dimensional problems are becoming more and more common. The "large p, small n" problem, in which there are more variables than samples, currently a challenge that many statisticians face. The penalized variable selection method is an effective method to deal with "large p, small n" problem. In particular, The Lasso (least absolute selection and shrinkage operator) proposed by Tibshirani has become an effective method to deal with this type of problem. the Lasso works well for the covariates which can be treated individually. When the covariates are grouped, it does not work well. Elastic net, group lasso, group MCP and group bridge are extensions of the Lasso. Group lasso enforces sparsity at the group level, rather than at the level of the individual covariates. Group bridge, group MCP produces sparse solutions both at the group level and at the level of the individual covariates within a group. Our simulation study shows that the group lasso forces complete grouping, group MCP encourages grouping to a rather slight extent, and group bridge is somewhere in between. If one expects that the proportion of nonzero group members to be greater than one-half, group lasso maybe a good choice; otherwise group MCP would be preferred. If one expects this proportion to be close to one-half, one may wish to use group bridge. A real data analysis example is also conducted for genetic variation (SNPs) data to find out the associations between SNPs and West Nile disease.</p> / Master of Science (MSc)
|
59 |
Semiparametric Methods for the Generalized Linear ModelChen, Jinsong 01 July 2010 (has links)
The generalized linear model (GLM) is a popular model in many research areas. In the GLM, each outcome of the dependent variable is assumed to be generated from a particular distribution function in the exponential family. The mean of the distribution depends on the independent variables. The link function provides the relationship between the linear predictor and the mean of the distribution function. In this dissertation, two semiparametric extensions of the GLM will be developed. In the first part of this dissertation, we have proposed a new model, called a semiparametric generalized linear model with a log-concave random component (SGLM-L). In this model, the estimate of the distribution of the random component has a nonparametric form while the estimate of the systematic part has a parametric form. In the second part of this dissertation, we have proposed a model, called a generalized semiparametric single-index mixed model (GSSIMM). A nonparametric component with a single index is incorporated into the mean function in the generalized linear mixed model (GLMM) assuming that the random component is following a parametric distribution.
In the first part of this dissertation, since most of the literature on the GLM deals with the parametric random component, we relax the parametric distribution assumption for the random component of the GLM and impose a log-concave constraint on the distribution. An iterative numerical algorithm for computing the estimators in the SGLM-L is developed. We construct a log-likelihood ratio test for inference. In the second part of this dissertation, we use a single index model to generalize the GLMM to have a linear combination of covariates enter the model via a nonparametric mean function, because the linear model in the GLMM is not complex enough to capture the underlying relationship between the response and its associated covariates. The marginal likelihood is approximated using the Laplace method. A penalized quasi-likelihood approach is proposed to estimate the nonparametric function and parameters including single-index coe±cients in the GSSIMM. We estimate variance components using marginal quasi-likelihood. Asymptotic properties of the estimators are developed using a similar idea by Yu (2008). A simulation example is carried out to compare the performance of the GSSIMM with that of the GLMM. We demonstrate the advantage of my approach using a study of the association between daily air pollutants and daily mortality adjusted for temperature and wind speed in various counties of North Carolina. / Ph. D.
|
60 |
Assessment of Penalized Regression for Genome-wide Association StudiesYi, Hui 27 August 2014 (has links)
The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single marker association methods. As an alternative to Single Marker Analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of Penalized Regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by False Discovery Rate (FDR) control, and assess their performance (including penalties incorporating linkage disequilibrium) in comparison with SMA. PR methods were compared with SMA on realistically simulated GWAS data consisting of genotype data from single and multiple chromosomes and a continuous phenotype and on real data. Based on our comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini-Hochberg FDR control. PR controlled the FDR conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on variable selection with FDR control. Incorporating LD into PR by adapting penalties developed for covariates measured on graphs can improve power but also generate morel false positives or wider regions for follow-up. We recommend using the Elastic Net with a mixing weight for the Lasso penalty near 0.5 as the best method. / Ph. D.
|
Page generated in 0.0485 seconds