Spelling suggestions: "subject:"imbalanced data set"" "subject:"imabalanced data set""
1 |
SCUT-DS: Methodologies for Learning in Imbalanced Data StreamsOlaitan, Olubukola January 2018 (has links)
The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that instances of some classes occur more frequently than others. Classes with the frequently occurring instances are referred to as the majority classes, while the classes with instances that occur less frequently are denoted as the minority classes.
Classification algorithms, or supervised learning techniques, use historic instances to build models, which are then used to predict the classes of unseen instances. Multi-class imbalanced data stream classification poses a great challenge to classical classification algorithms. This is due to the fact that traditional algorithms are usually biased towards the majority classes, since they have more examples of the majority classes when building the model. These traditional algorithms yield low predictive accuracy rates for the minority instances and need to be augmented, often with some form of sampling, in order to improve their overall performances.
In the literature, in both static and streaming environments, most studies focus on the binary class imbalance problem. Furthermore, research in multi-class imbalance in the data stream environment is limited. A number of researchers have proceeded by transforming a multi-class imbalanced setting into multiple binary class problems. However, such a transformation does not allow the stream to be studied in the original form and may introduce bias. The research conducted in this thesis aims to address this research gap by proposing a novel online learning methodology that combines oversampling of the minority classes with cluster-based majority class under-sampling, without decomposing the data stream into multiple binary sets. Rather, sampling involves continuously selecting a balanced number of instances across all classes for model building. Our focus is on improving the rate of correctly predicting instances of the minority classes in multi-class imbalanced data streams, through the introduction of the Synthetic Minority Over-sampling Technique (SMOTE) and Cluster-based Under-sampling - Data Streams (SCUT-DS) methodologies. In this work, we dynamically balance the classes by utilizing a windowing mechanism during the incremental sampling process. Our SCUT-DS algorithms are evaluated using six different types of classification techniques, followed by comparing their results against a state-of-the-art algorithm. Our contributions are tested using both synthetic and real data sets. The experimental results show that the approaches developed in this thesis yield high prediction rates of minority instances as contained in the multiple minority classes within a non-evolving stream.
|
2 |
Utveckling av beslutsstöd för kreditvärdighetArvidsson, Martin, Paulsson, Eric January 2013 (has links)
The aim is to develop a new decision-making model for credit-loans. The model will be specific for credit applicants of the OKQ8 bank, becauseit is based on data of earlier applicants of credit from the client (the bank). The final model is, in effect, functional enough to use informationabout a new applicant as input, and predict the outcome to either the good risk group or the bad risk group based on the applicant’s properties.The prediction may then lay the foundation for the decision to grant or deny credit loan. Because of the skewed distribution in the response variable, different sampling techniques are evaluated. These include oversampling with SMOTE, random undersampling and pure oversampling in the form of scalar weighting of the minority class. It is shown that the predictivequality of a classifier is affected by the distribution of the response, and that the oversampled information is not too redundant. Three classification techniques are evaluated. Our results suggest that a multi-layer neural network with 18 neurons in a hidden layer, equippedwith an ensemble technique called boosting, gives the best predictive power. The most successful model is based on a feed forward structure andtrained with a variant of back-propagation using conjugate-gradient optimization. Two other models with a good prediction quality are developed using logistic regression and a decision tree classifier, but they do not reach thelevel of the network. However, the results of these models are used to answer the question regarding which customer properties are importantwhen determining credit risk. Two examples of important customer properties are income and the number of earlier credit reports of the applicant. Finally, we use the best classification model to predict the outcome of a set of applicants declined by the existent filter. The results show that thenetwork model accepts over 60 % of the applicants who had previously been denied credit. This may indicate that the client’s suspicionsregarding that the existing model is too restrictive, in fact are true.
|
3 |
Μηχανική μάθηση σε ανομοιογενή δεδομένα / Machine learning in imbalanced data setsΛυπιτάκη, Αναστασία Δήμητρα Δανάη 07 July 2015 (has links)
Οι αλγόριθμοι μηχανικής μάθησης είναι επιθυμητό να είναι σε θέση να γενικεύσουν για οποιασδήποτε κλάση με ίδια ακρίβεια. Δηλαδή σε ένα πρόβλημα δύο κλάσεων - θετικών και αρνητικών περιπτώσεων - ο αλγόριθμος να προβλέπει με την ίδια ακρίβεια και τα θετικά και τα αρνητικά παραδείγματα. Αυτό είναι φυσικά η ιδανική κατάσταση. Σε πολλές εφαρμογές οι αλγόριθμοι καλούνται να μάθουν από ένα σύνολο στοιχείων, το οποίο περιέχει πολύ περισσότερα παραδείγματα από τη μια κλάση σε σχέση με την άλλη. Εν γένει, οι επαγωγικοί αλγόριθμοι είναι σχεδιασμένοι να ελαχιστοποιούν τα σφάλματα. Ως συνέπεια οι κλάσεις που περιέχουν λίγες περιπτώσεις μπορούν να αγνοηθούν κατά ένα μεγάλο μέρος επειδή το κόστος λανθασμένης ταξινόμησης της υπερ-αντιπροσωπευόμενης κλάσης ξεπερνά το κόστος λανθασμένης ταξινόμησης της μικρότερη κλάση. Το πρόβλημα των ανομοιογενών συνόλων δεδομένων εμφανίζεται και σε πολλές πραγματικές εφαρμογές όπως στην ιατρική διάγνωση, στη ρομποτική, στις διαδικασίες βιομηχανικής παραγωγής, στην ανίχνευση λαθών δικτύων επικοινωνίας, στην αυτοματοποιημένη δοκιμή του ηλεκτρονικού εξοπλισμού, και σε πολλές άλλες περιοχές.
Η παρούσα διπλωματική εργασία με τίτλο ‘Μηχανική Μάθηση με Ανομοιογενή Δεδομένα’ (Machine Learning with Imbalanced Data) αναφέρεται στην επίλυση του προβλήματος αποδοτικής χρήσης αλγορίθμων μηχανικής μάθησης σε ανομοιογενή/ανισοκατανεμημένα δεδομένα. Η διπλωματική περιλαμβάνει μία γενική περιγραφή των βασικών αλγορίθμων μηχανικής μάθησης και των μεθόδων αντιμετώπισης του προβλήματος ανομοιογενών δεδομένων. Παρουσιάζεται πλήθος αλγοριθμικών τεχνικών διαχείρισης ανομοιογενών δεδομένων, όπως οι αλγόριθμοι AdaCost, Cost Senistive Boosting, Metacost και άλλοι. Παρατίθενται οι μετρικές αξιολόγησης των μεθόδων Μηχανικής Μάθησης σε ανομοιογενή δεδομένα, όπως οι καμπύλες διαχείρισης λειτουργικών χαρακτηριστικών (ROC curves), καμπύλες ακρίβειας (PR curves) και καμπύλες κόστους.
Στο τελευταίο μέρος της εργασίας προτείνεται ένας υβριδικός αλγόριθμος που συνδυάζει τις τεχνικές OverBagging και Rotation Forest. Συγκρίνεται ο προτεινόμενος αλγόριθμος σε ένα σύνολο ανομοιογενών δεδομένων με άλλους αλγόριθμους και παρουσιάζονται τα αντίστοιχα πειραματικά αποτελέσματα που δείχνουν την καλύτερη απόδοση του προτεινόμενου αλγόριθμου. Τελικά διατυπώνονται τα συμπεράσματα της εργασίας και δίνονται χρήσιμες ερευνητικές κατευθύνσεις. / Machine Learning (ML) algorithms can generalize for every class with the same accuracy. In a problem of two classes, positive (true) and negative (false) cases-the algorithm can predict with the same accuracy the positive and negative examples that is the ideal case. In many applications ML algorithms are used in order to learn from data sets that include more examples from the one class in relationship with another class. In general inductive algorithms are designed in such a way that they can minimize the occurred errors. As a conclusion the classes that contain some cases can be ignored in a large percentage since the cost of the false classification of the super-represented class is greater than the cost of false classification of lower class. The problem of imbalanced data sets is occurred in many ‘real’ applications, such as medical diagnosis, robotics, industrial development processes, communication networks error detection, automated testing of electronic equipment and in other related areas.
This dissertation entitled ‘Machine Learning with Imbalanced Data’ is referred to the solution of the problem of efficient use of ML algorithms with imbalanced data sets. The thesis includes a general description of basic ML algorithms and related methods for solving imbalanced data sets. A number of algorithmic techniques for handling imbalanced data sets is presented, such as Adacost, Cost Sensitive Boosting, Metacost and other algorithms. The evaluation metrics of ML methods for imbalanced datasets are presented, including the ROC (Receiver Operating Characteristic) curves, the PR (Precision and Recall) curves and cost curves.
A new hybrid ML algorithm combining the OverBagging and Rotation Forest algorithms is introduced and the proposed algorithmic procedure is compared with other related algorithms by using the WEKA operational environment. Experimental results demonstrate the performance superiority of the proposed algorithm. Finally, the conclusions of this research work are presented and several future research directions are given.
|
Page generated in 0.0813 seconds