Spelling suggestions: "subject:"imbalanced classification"" "subject:"imbalanced 1classification""
1 |
PATTERN RECOGNITION IN CLASS IMBALANCED DATASETSSiddique, Nahian A 01 January 2016 (has links)
Class imbalanced datasets constitute a significant portion of the machine learning problems of interest, where recognizing the ‘rare class’ is the primary objective for most applications. Traditional linear machine learning algorithms are often not effective in recognizing the rare class. In this research work, a specifically optimized feed-forward artificial neural network (ANN) is proposed and developed to train from moderate to highly imbalanced datasets.
The proposed methodology deals with the difficulty in classification task in multiple stages—by optimizing the training dataset, modifying kernel function to generate the gram matrix and optimizing the NN structure. First, the training dataset is extracted from the available sample set through an iterative process of selective under-sampling. Then, the proposed artificial NN comprises of a kernel function optimizer to specifically enhance class boundaries for imbalanced datasets by conformally transforming the kernel functions. Finally, a single hidden layer weighted neural network structure is proposed to train models from the imbalanced dataset. The proposed NN architecture is derived to effectively classify any binary dataset with even very high imbalance ratio with appropriate parameter tuning and sufficient number of processing elements.
Effectiveness of the proposed method is tested on accuracy based performance metrics, achieving close to and above 90%, with several imbalanced datasets of generic nature and compared with state of the art methods. The proposed model is also used for classification of a 25GB computed tomographic colonography database to test its applicability for big data. Also the effectiveness of under-sampling, kernel optimization for training of the NN model from the modified kernel gram matrix representing the imbalanced data distribution is analyzed experimentally. Computation time analysis shows the feasibility of the system for practical purposes. This report is concluded with discussion of prospect of the developed model and suggestion for further development works in this direction.
|
2 |
A Model Fusion Based Framework For Imbalanced Classification Problem with Noisy DatasetJanuary 2014 (has links)
abstract: Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while data noise affects the learning classifier by providing inaccurate information and thus misleads the classifier. Because of these differences, data imbalance and data noise have been treated separately in the data mining field. Yet, such approach ignores the mutual effects and as a result may lead to new problems. A desirable solution is to tackle these two issues jointly. Noting the complementary nature of generative and discriminative models, this research proposes a unified model fusion based framework to handle the imbalanced classification with noisy dataset.
The phase I study focuses on the imbalanced classification problem. A generative classifier, Gaussian Mixture Model (GMM) is studied which can learn the distribution of the imbalance data to improve the discrimination power on imbalanced classes. By fusing this knowledge into cost SVM (cSVM), a CSG method is proposed. Experimental results show the effectiveness of CSG in dealing with imbalanced classification problems.
The phase II study expands the research scope to include the noisy dataset into the imbalanced classification problem. A model fusion based framework, K Nearest Gaussian (KNG) is proposed. KNG employs a generative modeling method, GMM, to model the training data as Gaussian mixtures and form adjustable confidence regions which are less sensitive to data imbalance and noise. Motivated by the K-nearest neighbor algorithm, the neighboring Gaussians are used to classify the testing instances. Experimental results show KNG method greatly outperforms traditional classification methods in dealing with imbalanced classification problems with noisy dataset.
The phase III study addresses the issues of feature selection and parameter tuning of KNG algorithm. To further improve the performance of KNG algorithm, a Particle Swarm Optimization based method (PSO-KNG) is proposed. PSO-KNG formulates model parameters and data features into the same particle vector and thus can search the best feature and parameter combination jointly. The experimental results show that PSO can greatly improve the performance of KNG with better accuracy and much lower computational cost. / Dissertation/Thesis / Doctoral Dissertation Industrial Engineering 2014
|
3 |
Handling Imbalanced Data Classification With Variational Autoencoding And Random Under-Sampling BoostingLudvigsen, Jesper January 2020 (has links)
In this thesis, a comparison of three different pre-processing methods for imbalanced classification data, is conducted. Variational Autoencoder, Random Under-Sampling Boosting and a hybrid approach of the two, are applied to three imbalanced classification data sets with different class imbalances. A logistic regression (LR) model is fitted to each pre-processed data set and based on its classification performance, the pre-processing methods are evaluated. All three methods shows indications of different advantages when handling class imbalances. For each pre-processed data, the LR-model has is better at correctly classifying minority class observations, compared to a LR-model fitted to the original class imbalanced data sets. Evaluating the overall classification performance, both VAE and RUSBoost shows improving classification results while the hybrid method performs worse for the moderate class imbalanced data and best for the highly imbalanced data.
|
4 |
Neonatal Sepsis Detection With Random Forest Classification for Heavily Imbalanced DataOsman Abubaker, Ayman January 2022 (has links)
Neonatal sepsis is associated with most cases ofmortality in the neonatal intensive care unit. Major challengesin detecting sepsis using suitable biomarkers has lead people tolook for alternative approaches in the form of Machine Learningtechniques. In this project, Random Forest classification wasperformed on a sepsis data set provided by Karolinska Hospital.We particularly focused on tackling class imbalance in the datausing sampling and cost-sensitive techniques. We compare theclassification performances of Random Forests in six differentsetups; four using oversampling and undersampling techniques;one using cost-sensitive learning and one basic Random Forest.The performance with the oversampling techniques were betterand could identify more sepsis patients than the other setups.The overall performances were also good, making the methodspotentially useful in practice. / Neonatal sepsis är orsaken till majoriteten av mortaliteten i neonatal intensivvården. Svårigheten i att detektera sepsis med hjälp av biomarkörer har lett många att leta efter alternativa metoder. Maskininlärningstekniker är en sådan alternativ metod som har i senaste tider ökat i användning inom vård och andra sektorer. I detta project användes Random Forest klassifikations algoritmen på en sepsis datamängd given av Karolinska Sjukhuset. Vi fokuserade på att hantera klassimbalansen i datan genom att använda olika provtagningsoch kostnadskänsliga metoder. Vi jämförde klassificeringsprestanda för Random Forest med sex olika inställningar; fyra av de använde provtagingsmetoderna; en av de använde en kostnadskänslig metod och en var en vanlig Random Forest. Det visade sig att modellens prestanda ökade som mest med översamplings metoderna. Den generella klassificeringsprestandan var också bra, vilket gör Random Forests tillsammans med ingsmetoderna potentiellt användbar i praktiken. / Kandidatexjobb i elektroteknik 2022, KTH, Stockholm
|
Page generated in 0.1232 seconds