Global ETD Search

Return to search

New perspectives in cross-validation

Appealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces a reliable estimate of the risk, although many questions remain concerning how best to compare such estimates across different models. Despite its widespread use, many theoretical problems remain unanswered for cross-validation, particularly in high-dimensional regimes where bias issues are non-negligible. We first provide an asymptotic analysis of the cross-validated risk in relation to the train-test split risk for a large class of estimators under stability conditions. This asymptotic analysis is expressed in the form of a central limit theorem, and allows us to characterize the speed-up of the cross-validation procedure for general parametric M-estimators. In particular, we show that when the loss used for fitting differs from that used for evaluation, k-fold cross-validation may offer a reduction in variance less (or greater) than k. We then turn our attention to the high-dimensional regime (where the number of parameters is comparable to the number of observations). In such a regime, k-fold cross-validation presents asymptotic bias, and hence increasing the number of folds is of interest. We study the extreme case of leave-one-out cross-validation, and show that, for generalized linear models under smoothness conditions, it is a consistent estimate of the risk at the optimal rate. Given the large computational requirements of leave-one-out cross-validation, we finally consider the problem of obtaining a fast approximate version of the leave-one-out cross-validation (ALO) estimator. We propose a general strategy for deriving formulas for such ALO estimators for penalized generalized linear models, and apply it to many common estimators such as the LASSO, SVM, nuclear norm minimization. The performance of such approximations are evaluated on simulated and real datasets.

https://doi.org/10.7916/d8-3z39-7v31

Statistics

Statistics--Methodology

Statistics--Models

Identifer	oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/d8-3z39-7v31
Date	January 2020
Creators	Zhou, Wenda
Source Sets	Columbia University
Language	English
Detected Language	English
Type	Theses

Page generated in 0.0017 seconds

New perspectives in cross-validation

Description

Links & Downloads

Tags

Additional Fields