Global ETD Search

Return to search

The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics

The small-sample size issue is a prevalent problem in Genomics and Proteomics today.
Bootstrap, a resampling method which aims at increasing the efficiency of data usage,
is considered to be an effort to overcome the problem of limited sample size. This dissertation
studies the application of bootstrap to two problems of supervised learning with small
sample data: estimation of the misclassification error of Gaussian discriminant analysis,
and the bagging ensemble classification method.
Estimating the misclassification error of discriminant analysis is a classical problem in
pattern recognition and has many important applications in biomedical research. Bootstrap
error estimation has been shown empirically to be one of the best estimation methods in
terms of root mean squared error. In the first part of this work, we conduct a detailed
analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA)
classification rule under Gaussian populations. We derive the exact formulas of the first
and the second moment of the zero bootstrap and the convex bootstrap estimators, as well
as their cross moments with the resubstitution estimator and the true error. Based on these
results, we obtain the exact formulas of the bias, the variance, and the root mean squared
error of the deviation from the true error of these bootstrap estimators. This includes the
moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight
for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all
the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions.
In the second part of this work, we conduct an extensive empirical investigation of
bagging, which is an application of bootstrap to ensemble classification. We investigate
the performance of bagging in the classification of small-sample gene-expression data and
protein-abundance mass spectrometry data, as well as the accuracy of small-sample error
estimation with this ensemble classification rule. We observed that, under t-test and
RELIEF filter-based feature selection, bagging generally does a good job of improving
the performance of unstable, overtting classifiers, such as CART decision trees and neural
networks, but that improvement was not sufficient to beat the performance of single stable,
non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or
3-nearest neighbors. Furthermore, the ensemble method did not improve the performance
of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator
that is intended to remove estimator bias, by formulating carefully how the error
count is normalized, and investigate the performance of error estimation for bagging of
common classification rules, including LDA, 3NN, and CART, applied on both synthetic
and real patient data, corresponding to the use of common error estimators such as resubstitution,
leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus,
bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the
numerical experiments indicated that the performance of the out-of-bag estimator is very
similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically
biased. The performance of the other estimators is consistent with their performance
with the corresponding single classifiers, as reported in other studies. The results of this
work are expected to provide helpful guidance to practitioners who are interested in applying
the bootstrap in supervised learning applications.

http://hdl.handle.net/1969.1/ETD-TAMU-2011-05-9114

Out-of-Bag Estimation

Ensemble Methods

Genomics

Proteomics

Identifer	oai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/ETD-TAMU-2011-05-9114
Date	2011 May 1900
Creators	Vu, Thang
Contributors	Braga-Neto, Ulisses
Source Sets	Texas A and M University
Language	en_US
Detected Language	English
Type	thesis, text
Format	application/pdf

Page generated in 0.0023 seconds

The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics

Description

Links & Downloads

Tags

Additional Fields