High-throughput technologies for rapid measurement of vast numbers of biolog-
ical variables offer the potential for highly discriminatory diagnosis and prognosis;
however, high dimensionality together with small samples creates the need for fea-
ture selection, while at the same time making feature-selection algorithms less reliable.
Feature selection is required to avoid overfitting, and the combinatorial nature of the
problem demands a suboptimal feature-selection algorithm.
In this dissertation, we have found that feature selection is problematic in small-
sample settings via three different approaches. First we examined the feature-ranking
performance of several kinds of error estimators for different classification rules, by
considering all feature subsets and using 2 measures of performance. The results
show that their ranking is strongly affected by inaccurate error estimation. Secondly,
since enumerating all feature subsets is computationally impossible in practice, a
suboptimal feature-selection algorithm is often employed to find from a large set of
potential features a small subset with which to classify the samples. If error estimation
is required for a feature-selection algorithm, then the impact of error estimation can
be greater than the choice of algorithm. Lastly, we took a regression approach by
comparing the classification errors for the optimal feature sets and the errors for
the feature sets found by feature-selection algorithms. Our study shows that it is
unlikely that feature selection will yield a feature set whose error is close to that of
the optimal feature set, and the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist.
Identifer | oai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/5796 |
Date | 17 September 2007 |
Creators | Sima, Chao |
Contributors | Dougherty, Edward R. |
Publisher | Texas A&M University |
Source Sets | Texas A and M University |
Language | en_US |
Detected Language | English |
Type | Book, Thesis, Electronic Dissertation, text |
Format | 7214430 bytes, electronic, application/pdf, born digital |
Page generated in 0.0018 seconds