1 |
Variable Screening Methods in Multi-Category Problems for Ultra-High Dimensional DataZeng, Yue, Zeng, Yue January 2017 (has links)
Variable screening techniques are fast and crude techniques to scan high-dimensional data and conduct dimension reduction before a refined variable selection method is applied. Its marginal analysis feature makes the method computationally feasible for ultra-high dimensional problems. However, most existing screening methods for classification problems are designed only for binary classification problems. There is lack of a comprehensive study on variable screening for multi-class classification problems. This research aims to fill the gap by developing variable screening for multi-class problems, to meet the need of high dimensional classification. The work has useful applications in cancer study, medicine, engineering and biology. In this research, we propose and investigate new and effective screening methods for multi-class classification problems. We consider two types of screening methods. The first one conducts screening for multiple binary classification problems separately and then aggregates the selected variables. The second one conducts screening for multi-class classification problems directly. In particular, for each method we investigate important issues such as choices of classification algorithms, variable ranking, and model size determination. We implement various selection criteria and compare their performance. We conduct extensive simulation studies to evaluate and compare the proposed screening methods with existing ones, which show that the new methods are promising. Furthermore, we apply the proposed methods to four cancer studies. R code has been developed for each method.
|
2 |
Variable screening and graphical modeling for ultra-high dimensional longitudinal dataZhang, Yafei 02 July 2019 (has links)
Ultrahigh-dimensional variable selection is of great importance in the statistical research. And independence screening is a powerful tool to select important variable when there are massive variables. Some commonly used independence screening procedures are based on single replicate data and are not applicable to longitudinal data. This motivates us to propose a new Sure Independence Screening (SIS) procedure to bring the dimension from ultra-high down to a relatively large scale which is similar to or smaller than the sample size. In chapter 2, we provide two types of SIS, and their iterative extensions (iterative SIS) to enhance the finite sample performance. An upper bound on the number of variables to be included is derived and assumptions are given under which sure screening is applicable. The proposed procedures are assessed by simulations and an application of them to a study on systemic lupus erythematosus illustrates the practical use of these procedures. After the variables screening process, we then explore the relationship among the variables. Graphical models are commonly used to explore the association network for a set of variables, which could be genes or other objects under study. However, graphical modes currently used are only designed for single replicate data, rather than longitudinal data. In chapter 3, we propose a penalized likelihood approach to identify the edges in a conditional independence graph for longitudinal data. We used pairwise coordinate descent combined with second order cone programming to optimize the penalized likelihood and estimate the parameters. Furthermore, we extended the nodewise regression method the for longitudinal data case. Simulation and real data analysis exhibit the competitive performance of the penalized likelihood method. / Doctor of Philosophy / Longitudinal data have received a considerable amount of attention in the fields of health science studies. The information from this type of data could be helpful with disease detection and control. Besides, a graph of factors related to the disease can also be built up to represent their relationships between each other. In this dissertation, we develop a framework to find out important factor(s) from thousands of factors in longitudinal data that is/are related to the disease. In addition, we develop a graphical method that can show the relationship among the important factors identified from the previous screening. In practice, combining these two methods together can identify important factors for a disease as well as the relationship among the factors, and thus provide us a deeper understanding about the disease.
|
3 |
Variable screening method using statistical sensitivity analysis in RBDOBae, Sangjune 01 May 2012 (has links)
A variable screening method is introduced to reduce the computational cost caused by the curse of dimension of high dimensional problem in RBDO. The screening method considers the output variance of the constraint functions and uses test-of-hypothesis to filter necessary variables. Also, the method is applicable to implicit functions as well as explicit functions. Suitable number of samples to obtain consistent test result is calculated. 3 examples are demonstrated with detailed variable screening procedure and RBDO result.
|
4 |
Contributions to variable selection for mean modeling and variance modeling in computer experimentsAdiga, Nagesh 17 January 2012 (has links)
This thesis consists of two parts. The first part reviews a Variable Search, a variable selection procedure for mean modeling. The second part deals with variance modeling for robust parameter design in computer experiments.
In the first chapter of my thesis, Variable Search (VS) technique developed by Shainin (1988) is reviewed. VS has received quite a bit of attention from experimenters in industry. It uses the experimenters' knowledge about the process, in terms of good and bad settings and their importance. In this technique, a few experiments are conducted first at the best and worst settings of the variables to ascertain that they are indeed different from each other. Experiments are then conducted sequentially in two stages, namely swapping and capping, to determine the significance of variables, one at a time. Finally after all the significant variables have been identified, the model is fit and the best settings are determined.
The VS technique has not been analyzed thoroughly. In this report, we analyze each stage of the method mathematically. Each stage is formulated as a hypothesis test, and its performance expressed in terms of the model parameters. The performance of the VS technique is expressed as a function of the performances in each stage. Based on this, it is possible to compare its performance with the traditional techniques.
The second and third chapters of my thesis deal with variance modeling for robust parameter design in computer experiments. Computer experiments based on engineering models might be used to explore process behavior if physical experiments (e.g. fabrication of nanoparticles) are costly or time consuming. Robust parameter design (RPD) is a key technique to improve process repeatability. Absence of replicates in computer experiments (e.g. Space Filling Design (SFD)) is a challenge in locating RPD solution. Recently, there have been studies (e.g. Bates et al. (2005), Chen et al. (2006), Dellino et al. (2010 and 2011), Giovagnoli and Romano (2008)) of RPD issues on computer experiments. Transmitted variance model (TVM) proposed by Shoemaker and Tsui. (1993) for physical experiments can be applied in computer simulations. The approaches stated above rely heavily on the estimated mean model because they obtain expressions for variance directly from mean models or by using them for generating replicates. Variance modeling based on some kind of replicates relies on the estimated mean model to a lesser extent. To the best of our knowledge, there is no rigorous research on variance modeling needed for RPD in computer experiments.
We develop procedures for identifying variance models. First, we explore procedures to decide groups of pseudo replicates for variance modeling. A formal variance change-point procedure is developed to rigorously determine the replicate groups. Next, variance model is identified and estimated through a three-step variable selection procedure. Properties of the proposed method are investigated under various conditions through analytical and empirical studies. In particular, impact of correlated response on the performance is discussed.
|
5 |
THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTIONXie, Jin 01 January 2018 (has links)
When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia.
|
6 |
Efficient variable screening method and confidence-based method for reliability-based design optimizationCho, Hyunkyoo 01 May 2014 (has links)
The objectives of this study are (1) to develop an efficient variable screening method for reliability-based design optimization (RBDO) and (2) to develop a new RBDO method incorporated with the confidence level for limited input data problems. The current research effort involves: (1) development of a partial output variance concept for variable screening; (2) development of an effective variable screening sequence; (3) development of estimation method for a confidence level of a reliability output; and (4) development of a design sensitivity method for the confidence level.
In the RBDO process, surrogate models are frequently used to reduce the number of simulations because analysis of a simulation model takes a great deal of computational time. On the other hand, to obtain accurate surrogate models, we have to limit the dimension of the RBDO problem and thus mitigate the curse of dimensionality. Therefore, it is desirable to develop an efficient and effective variable screening method for reduction of the dimension of the RBDO problem. In this study, it is found that output variance is critical for identifying important variables in the RBDO process. A partial output variance, which is an efficient approximation method based on the univariate dimension reduction method (DRM), is proposed to calculate output variance efficiently. For variable screening, the variables that has larger partial output variances are selected as important variables. To determine important variables, hypothesis testing is used so that possible errors are contained at a user-specified error level. Also, an appropriate number of samples is proposed for calculating the partial output variance. Moreover, a quadratic interpolation method is studied in detail to calculate output variance efficiently. Using numerical examples, performance of the proposed variable screening method is verified. It is shown that the proposed method finds important variables efficiently and effectively.
The reliability analysis and the RBDO require an exact input probabilistic model to obtain accurate reliability output and RBDO optimum design. However, often only limited input data are available to generate the input probabilistic model in practical engineering problems. The insufficient input data induces uncertainty in the input probabilistic model, and this uncertainty forces the RBDO optimum to lose its confidence level. Therefore, it is necessary to consider the reliability output, which is defined as the probability of failure, to follow a probability distribution. The probability of the reliability output is obtained with consecutive conditional probabilities of input distribution type and parameters using the Bayesian approach. The approximate conditional probabilities are obtained under reasonable assumptions, and Monte Carlo simulation is applied to practically calculate the probability of the reliability output. A confidence-based RBDO (C-RBDO) problem is formulated using the derived probability of the reliability output. In the C-RBDO formulation, the probabilistic constraint is modified to include both the target reliability output and the target confidence level. Finally, the design sensitivity of the confidence level, which is the new probabilistic constraint, is derived to support an efficient optimization process. Using numerical examples, the accuracy of the developed design sensitivity is verified and it is confirmed that C-RBDO optimum designs incorporate appropriate conservativeness according to the given input data.
|
7 |
Uncovering Structure in High-Dimensions: Networks and Multi-task Learning ProblemsKolar, Mladen 01 July 2013 (has links)
Extracting knowledge and providing insights into complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Statistical modeling has become ubiquitous in the analysis of high dimensional functional data in search of better understanding of cognition mechanisms, in the exploration of large-scale gene regulatory networks in hope of developing drugs for lethal diseases, and in prediction of volatility in stock market in hope of beating the market. Statistical analysis in these high-dimensional data sets is possible only if an estimation procedure exploits hidden structures underlying data.
This thesis develops flexible estimation procedures with provable theoretical guarantees for uncovering unknown hidden structures underlying data generating process. Of particular interest are procedures that can be used on high dimensional data sets where the number of samples n is much smaller than the ambient dimension p. Learning in high-dimensions is difficult due to the curse of dimensionality, however, the special problem structure makes inference possible. Due to its importance for scientific discovery, we put emphasis on consistent structure recovery throughout the thesis. Particular focus is given to two important problems, semi-parametric estimation of networks and feature selection in multi-task learning.
|
Page generated in 0.0754 seconds