Global ETD Search

Return to search

Robust Feature Screening Procedures for Mixed Type of Data

High dimensional data have been frequently collected in many fields of scientific research and technological development. The traditional idea of best subset selection methods, which use penalized L_0 regularization, is computationally too expensive for many modern statistical applications. A large number of variable selection approaches via various forms of penalized least squares or likelihood have been developed to select significant variables and estimate their effects simultaneously in high dimensional statistical inference. However, in modern applications in areas such as genomics and proteomics, ultra-high dimensional data are often collected, where the dimension of data may grow exponentially with the sample size. In such problems, the regularization methods can become computationally unstable or even infeasible. To deal with the ultra-high dimensionality, Fan and Lv (2008) proposed a variable screening procedure via correlation learning to reduce dimensionality in sparse ultra-high dimensional models. Since then many authors further developed the procedure and applied to various statistical models. However, they all focused on single type of predictors, that is, the predictors are either all continuous or all discrete. In practice, we often collect mixed type of data, which contains both continuous and discrete predictors. For example, in genetic studies, we can collect information on both gene expression profiles and single nucleotide polymorphism (SNP) genotypes. Furthermore, outliers are often present in the observations due to experimental errors and other reasons. And the true trend underlying the data might not follow the parametric models assumed in many existing screening procedures. Hence a robust screening procedure against outliers and model misspecification is desired. In my dissertation, I shall propose a robust feature screening procedure for mixed type of data. To gain insights on screening for individual types of data, I first studied feature screening procedures for single type of data in Chapter 2 based on marginal quantities. For each type of data, new feature screening procedures are proposed and simulation studies are performed to compare their performances with existing procedures. The aim is to identify a best robust screening procedure for each type of data. In Chapter 3, I combine these best screening procedures to form the robust feature screening procedure for mixed type of data. Its performance will be assessed by simulation studies. I shall further illustrate the proposed procedure by the analysis of a real example. / Ph. D. / In modern applications in areas such as genomics and proteomics, ultra-high dimensional data are often collected, where the dimension of data may grow exponentially with the sample size. To deal with the ultra-high dimensionality, Fan and Lv (2008) proposed a variable screening procedure via correlation learning to reduce dimensionality in sparse ultra-high dimensional models. Since then many authors further developed the procedure and applied to various statistical models. However, they all focused on single type of predictors, that is, the predictors are either all continuous or all discrete. In practice, we often collect mixed type of data, which contains both continuous and discrete predictors. Furthermore, outliers are often present in the observations due to experimental errors and other reasons. Hence a robust screening procedure against outliers and model misspecification is desired. In my dissertation, I shall propose a robust feature screening procedure for mixed type of data. I first studied feature screening procedures for single type of data based on marginal quantities. For each type of data, new feature screening procedures are proposed and simulation studies are performed to compare their performances with existing procedures. The aim is to identify a best robust screening procedure for each type of data. Then i combined these best screening procedures to form the robust feature screening procedure for mixed type of data. Its performance will be assessed by simulation studies and the analysis of real examples.

feature screening

mixed type of data

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/73709
Date	16 December 2016
Creators	Sun, Jinhui
Contributors	Statistics, Du, Pang, Deng, Xinwei, Hong, Yili, Kim, Inyoung
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Detected Language	English
Type	Dissertation
Format	ETD, application/pdf
Rights	In Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0024 seconds

Robust Feature Screening Procedures for Mixed Type of Data

Description

Links & Downloads

Tags

Additional Fields