Return to search

Contributions to Structured Variable Selection Towards Enhancing Model Interpretation and Computation Efficiency

The advances in data-collecting technologies provides great opportunities to access large sample-size data sets with high dimensionality. Variable selection is an important procedure to extract useful knowledge from such complex data. While in many real-data applications, appropriate selection of variables should facilitate the model interpretation and computation efficiency. It is thus important to incorporate domain knowledge of underlying data generation mechanism to select key variables for improving the model performance. However, general variable selection techniques, such as the best subset selection and the Lasso, often do not take the underlying data generation mechanism into considerations. This thesis proposal aims to develop statistical modeling methodologies with a focus on the structured variable selection towards better model interpretation and computation efficiency. Specifically, this thesis proposal consists of three parts: an additive heredity model with coefficients incorporating the multi-level data, a regularized dynamic generalized linear model with piecewise constant functional coefficients, and a structured variable selection method within the best subset selection framework.

In Chapter 2, an additive heredity model is proposed for analyzing mixture-of-mixtures (MoM) experiments. The MoM experiment is different from the classical mixture experiment in that the mixture component in MoM experiments, known as the major component, is made up of sub-components, known as the minor components. The proposed model considers an additive structure to inherently connect the major components with the minor components. To enable a meaningful interpretation for the estimated model, we apply the hierarchical and heredity principles by using the nonnegative garrote technique for model selection. The performance of the additive heredity model was compared to several conventional methods in both unconstrained and constrained MoM experiments. The additive heredity model was then successfully applied in a real problem of optimizing the Pringlestextsuperscript{textregistered} potato crisp studied previously in the literature.

In Chapter 3, we consider the dynamic effects of variables in the generalized linear model such as logistic regression. This work is motivated from the engineering problem with varying effects of process variables to product quality caused by equipment degradation. To address such challenge, we propose a penalized dynamic regression model which is flexible to estimate the dynamic coefficient structure. The proposed method considers modeling the functional coefficient parameter as piecewise constant functions. Specifically, under the penalized regression framework, the fused lasso penalty is adopted for detecting the changes in the dynamic coefficients. The group lasso penalty is applied to enable a sparse selection of variables. Moreover, an efficient parameter estimation algorithm is also developed based on alternating direction method of multipliers. The performance of the dynamic coefficient model is evaluated in numerical studies and three real-data examples.

In Chapter 4, we develop a structured variable selection method within the best subset selection framework. In the literature, many techniques within the LASSO framework have been developed to address structured variable selection issues. However, less attention has been spent on structured best subset selection problems. In this work, we propose a sparse Ridge regression method to address structured variable selection issues. The key idea of the proposed method is to re-construct the regression matrix in the angle of experimental designs. We employ the estimation-maximization algorithm to formulate the best subset selection problem as an iterative linear integer optimization (LIO) problem. the mixed integer optimization algorithm as the selection step. We demonstrate the power of the proposed method in various structured variable selection problems. Moverover, the proposed method can be extended to the ridge penalized best subset selection problems. The performance of the proposed method is evaluated in numerical studies. / Doctor of Philosophy / The advances in data-collecting technologies provides great opportunities to access large sample-size data sets with high dimensionality. Variable selection is an important procedure to extract useful knowledge from such complex data. While in many real-data applications, appropriate selection of variables should facilitate the model interpretation and computation efficiency. It is thus important to incorporate domain knowledge of underlying data generation mechanism to select key variables for improving the model performance.

However, general variable selection techniques often do not take the underlying data generation mechanism into considerations. This thesis proposal aims to develop statistical modeling methodologies with a focus on the structured variable selection towards better model interpretation and computation efficiency. The proposed approaches have been applied to real-world problems to demonstrate their model performance.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/96767
Date07 February 2020
CreatorsShen, Sumin
ContributorsStatistics, Deng, Xinwei, Franck, Christopher T., Kim, Inyoung, Jin, Ran
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0021 seconds