1 |
Beyond the Boundaries of SMOTE: A Framework for Manifold-based Synthetic OversamplingBellinger, Colin January 2016 (has links)
Within machine learning, the problem of class imbalance refers to the scenario in which one or more classes is significantly outnumbered by the others. In the most extreme case, the minority class is not only significantly outnumbered by the majority class, but it also considered to be rare, or absolutely imbalanced. Class imbalance appears in a wide variety of important domains, ranging from oil spill and fraud detection, to text classification and medical diagnosis. Given this, it has been deemed as one of the ten most important research areas in data mining, and for more than a decade now the machine learning community has been coming together in an attempt to unequivocally solve the problem.
The fundamental challenge in the induction of a classifier from imbalanced training data is in managing the prediction bias. The current state-of-the-art methods deal with this by readjusting misclassification costs or by applying resampling methods. In cases of absolute imbalance, these methods are insufficient; rather, it has been observed that we need more training examples. The nature of class imbalance, however, dictates that additional examples cannot be acquired, and thus, synthetic oversampling becomes the natural choice.
We recognize the importance of selecting algorithms with assumptions and biases that are appropriate for the properties of the target data, and argue that this is of absolute importance when it comes to developing synthetic oversampling methods because a large generative leap must be made from a relatively small training set. In particular, our research into gamma-ray spectral classification has demonstrated the benefits of incorporating prior knowledge of conformance to the manifold assumption into the synthetic oversampling algorithms.
We empirically demonstrate the negative impact of the manifold property on the state-of-the-art methods, and propose a framework for manifold-based synthetic oversampling. We algorithmically present the generic form of the framework and demonstrate formalizations of it with PCA and the denoising autoencoder. Through use of the helix and swiss roll datasets, which are standards in the manifold learning community, we visualize and qualitatively analyze the benefits of our proposed framework. Moreover, we unequivocally show the framework to be superior on three real-world gamma-ray spectral datasets and on sixteen benchmark UCI datasets in general. Specifically, our results demonstrate that the framework for manifold-based synthetic oversampling produces higher area under the ROC results than the current state-of-the-art and degrades less on data that conforms to the manifold assumption.
|
2 |
SCUT-DS: Methodologies for Learning in Imbalanced Data StreamsOlaitan, Olubukola January 2018 (has links)
The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that instances of some classes occur more frequently than others. Classes with the frequently occurring instances are referred to as the majority classes, while the classes with instances that occur less frequently are denoted as the minority classes.
Classification algorithms, or supervised learning techniques, use historic instances to build models, which are then used to predict the classes of unseen instances. Multi-class imbalanced data stream classification poses a great challenge to classical classification algorithms. This is due to the fact that traditional algorithms are usually biased towards the majority classes, since they have more examples of the majority classes when building the model. These traditional algorithms yield low predictive accuracy rates for the minority instances and need to be augmented, often with some form of sampling, in order to improve their overall performances.
In the literature, in both static and streaming environments, most studies focus on the binary class imbalance problem. Furthermore, research in multi-class imbalance in the data stream environment is limited. A number of researchers have proceeded by transforming a multi-class imbalanced setting into multiple binary class problems. However, such a transformation does not allow the stream to be studied in the original form and may introduce bias. The research conducted in this thesis aims to address this research gap by proposing a novel online learning methodology that combines oversampling of the minority classes with cluster-based majority class under-sampling, without decomposing the data stream into multiple binary sets. Rather, sampling involves continuously selecting a balanced number of instances across all classes for model building. Our focus is on improving the rate of correctly predicting instances of the minority classes in multi-class imbalanced data streams, through the introduction of the Synthetic Minority Over-sampling Technique (SMOTE) and Cluster-based Under-sampling - Data Streams (SCUT-DS) methodologies. In this work, we dynamically balance the classes by utilizing a windowing mechanism during the incremental sampling process. Our SCUT-DS algorithms are evaluated using six different types of classification techniques, followed by comparing their results against a state-of-the-art algorithm. Our contributions are tested using both synthetic and real data sets. The experimental results show that the approaches developed in this thesis yield high prediction rates of minority instances as contained in the multiple minority classes within a non-evolving stream.
|
Page generated in 0.1069 seconds