Global ETD Search

Return to search

Parametric classification and variable selection by the minimum integrated squared error criterion

This thesis presents a robust solution to the classification and variable selection problem when the dimension of the data, or number of predictor variables, may greatly exceed the number of observations. When faced with the problem of classifying objects given many measured attributes of the objects, the goal is to build a model that makes the most accurate predictions using only the most meaningful subset of the available measurements. The introduction of [cursive l] 1 regularized model titling has inspired many approaches that simultaneously do model fitting and variable selection. If parametric models are employed, the standard approach is some form of regularized maximum likelihood estimation. While this is an asymptotically efficient procedure under very general conditions, it is not robust. Outliers can negatively impact both estimation and variable selection. Moreover, outliers can be very difficult to identify as the number of predictor variables becomes large. Minimizing the integrated squared error, or L 2 error, while less efficient, has been shown to generate parametric estimators that are robust to a fair amount of contamination in several contexts. In this thesis, we present a novel robust parametric regression model for the binary classification problem based on L 2 distance, the logistic L 2 estimator (L 2 E). To perform simultaneous model fitting and variable selection among correlated predictors in the high dimensional setting, an elastic net penalty is introduced. A fast computational algorithm for minimizing the elastic net penalized logistic L 2 E loss is derived and results on the algorithm's global convergence properties are given. Through simulations we demonstrate the utility of the penalized logistic L 2 E at robustly recovering sparse models from high dimensional data in the presence of outliers and inliers. Results on real genomic data are also presented.

Pure sciences

Parametric classification

Variable selection

Error criterion

Logisic regression

Minimum distance estimation

Majorization-minimization

Statistics

Identifer	oai:union.ndltd.org:RICE/oai:scholarship.rice.edu:1911/70219
Date	January 2012
Contributors	Scott, David W.
Source Sets	Rice University
Language	English
Detected Language	English
Type	Thesis, Text
Format	98 p., application/pdf

Page generated in 0.0019 seconds

Parametric classification and variable selection by the minimum integrated squared error criterion

Description

Links & Downloads

Tags

Additional Fields