Return to search

Improving Feature Selection Techniques for Machine Learning

As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases.

Identiferoai:union.ndltd.org:GEORGIA/oai:digitalarchive.gsu.edu:cs_diss-1026
Date27 November 2007
CreatorsTan, Feng
PublisherDigital Archive @ GSU
Source SetsGeorgia State University
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceComputer Science Dissertations

Page generated in 0.0023 seconds