Global ETD Search

Return to search

Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem.

Machine learning

Classification

Imbalanced data learning

Diversified ensemble

Active learning

Artificial data generation

Bioinformatics

Protein methylation

Identifer	oai:union.ndltd.org:GEORGIA/oai:scholarworks.gsu.edu:cs_diss-1060
Date	07 May 2011
Creators	Ding, Zejin
Publisher	ScholarWorks @ Georgia State University
Source Sets	Georgia State University
Detected Language	English
Type	text
Format	application/pdf
Source	Computer Science Dissertations

Page generated in 0.0018 seconds

Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

Description

Links & Downloads

Tags

Additional Fields