• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 28
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 40
  • 40
  • 29
  • 21
  • 20
  • 13
  • 9
  • 8
  • 7
  • 7
  • 7
  • 7
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

DING, ZEJIN 07 May 2011 (has links)
In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem.
22

Amélioration des procédures adaptatives pour l'apprentissage supervisé des données réelles / Improving adaptive methods of supervised learning for real data

Bahri, Emna 08 December 2010 (has links)
L'apprentissage automatique doit faire face à différentes difficultés lorsqu'il est confronté aux particularités des données réelles. En effet, ces données sont généralement complexes, volumineuses, de nature hétérogène, de sources variées, souvent acquises automatiquement. Parmi les difficultés les plus connues, on citera les problèmes liés à la sensibilité des algorithmes aux données bruitées et le traitement des données lorsque la variable de classe est déséquilibrée. Le dépassement de ces problèmes constitue un véritable enjeu pour améliorer l'efficacité du processus d'apprentissage face à des données réelles. Nous avons choisi dans cette thèse de réfléchir à des procédures adaptatives du type boosting qui soient efficaces en présence de bruit ou en présence de données déséquilibrées.Nous nous sommes intéressés, d’abord, au contrôle du bruit lorsque l'on utilise le boosting. En effet, les procédures de boosting ont beaucoup contribué à améliorer l'efficacité des procédures de prédiction en data mining, sauf en présence de données bruitées. Dans ce cas, un double problème se pose : le sur-apprentissage des exemples bruités et la détérioration de la vitesse de convergence du boosting. Face à ce double problème, nous proposons AdaBoost-Hybride, une adaptation de l’algorithme Adaboost fondée sur le lissage des résultats des hypothèses antérieures du boosting, qui a donné des résultats expérimentaux très satisfaisants.Ensuite, nous nous sommes intéressés à un autre problème ardu, celui de la prédiction lorsque la distribution de la classe est déséquilibrée. C'est ainsi que nous proposons une méthode adaptative du type boosting fondée sur la classification associative qui a l’intérêt de permettre la focalisation sur des petits groupes de cas, ce qui est bien adapté aux données déséquilibrées. Cette méthode repose sur 3 contributions : FCP-Growth-P, un algorithme supervisé de génération des itemsets de classe fréquents dérivé de FP-Growth dans lequel est introduit une condition d'élagage fondée sur les contre-exemples pour la spécification des règles, W-CARP une méthode de classification associative qui a pour but de donner des résultats au moins équivalents à ceux des approches existantes pour un temps d'exécution beaucoup plus réduit, enfin CARBoost, une méthode de classification associative adaptative qui utilise W-CARP comme classifieur faible. Dans un chapitre applicatif spécifique consacré à la détection d’intrusion, nous avons confronté les résultats de AdaBoost-Hybride et de CARBoost à ceux des méthodes de référence (données KDD Cup 99). / Machine learning often overlooks various difficulties when confronted real data. Indeed, these data are generally complex, voluminous, and heterogeneous, due to the variety of sources. Among these problems, the most well known concern the sensitivity of the algorithms to noise and unbalanced data. Overcoming these problems is a real challenge to improve the effectiveness of the learning process against real data. In this thesis, we have chosen to improve adaptive procedures (boosting) that are less effective in the presence of noise or with unbalanced data.First, we are interested in robustifying Boosting against noise. Most boosting procedures have contributed greatly to improve the predictive power of classifiers in data mining, but they are prone to noisy data. In this case, two problems arise, (1) the over-fitting due to the noisy examples and (2) the decrease of convergence rate of boosting. Against these two problems, we propose AdaBoost-Hybrid, an adaptation of the Adaboost algorithm that takes into account mistakes made in all the previous iteration. Experimental results are very promising.Then, we are interested in another difficult problem, the prediction when the class is unbalanced. Thus, we propose an adaptive method based on boosted associative classification. The interest of using associations rules is allowing the focus on small groups of cases, which is well suited for unbalanced data. This method relies on 3 contributions: (1) FCP-Growth-P, a supervised algorithm for extracting class frequent itemsets, derived from FP-Growth by introducing the condition of pruning based on counter-examples to specify rules, (2) W-CARP associative classification method which aims to give results at least equivalent to those of existing approaches but in a faster manner, (3) CARBoost, a classification method that uses adaptive associative W-CARP as weak classifier. Finally, in a chapter devoted to the specific application of intrusion’s detection, we compared the results of AdaBoost-Hybrid and CARBoost to those of reference methods (data KDD Cup 99).
23

Exploring Alarm Data for Improved Return Prediction in Radios : A Study on Imbalanced Data Classification

Färenmark, Sofia January 2023 (has links)
The global tech company Ericsson has been tracking the return rate of their products for over 30 years, using it as a key performance indicator (KPI). These KPIs play a critical role in making sound business decisions, identifying areas for improvement, and planning. To enhance the customer experience, the company highly values the ability to predict the number of returns in advance each month. However, predicting returns is a complex problem affected by multiple factors that determine when radios are returned. Analysts at the company have observed indications of a potential correlation between alarm data and the number of returns. This paper aims to address the need for better prediction models to improve return rate forecasting for radios, utilizing alarm data. The alarm data, which is stored in an internal database, includes logs of activated alarms at various sites, along with technical and logistical information about the products, as well as the historical records of returns. The problem is approached as a classification task, where radios are classified as either "return" or "no return" for a specific month, using the alarm dataset as input. However, due to the significantly smaller number of returned radios compared to the distributed ones, the dataset suffers from a heavy class imbalance. The imbalance class problem has garnered considerable attention in the field of machine learning in recent years, as traditional classification models struggle to identify patterns in the minority class of imbalanced datasets. Therefore, a specific method that addresses the imbalanced class problem was required to construct an effective prediction model for returns. Therefore, this paper has adopted a systematic approach inspired by similar problems. It applies the feature selection methods LASSO and Boruta, along with the resampling technique SMOTE, and evaluates various classifiers including the Support vector machine (SVM), Random Forest classifier (RFC), Decision tree (DT), and a Neural network (NN) with weights to identify the best-performing model. As accuracy is not suitable as an evaluation metric for imbalanced datasets, the AUC and AUPRC values were calculated for all models to assess the impact of feature selection, weights, resampling techniques, and the choice of classifier. The best model was determined to be the NN with weights, achieving a median AUC value of 0.93 and a median AUPRC value of 0.043. Likewise, both the LASSO+SVM+SMOTE and LASSO+RFC+SMOTE models demonstrated similar performance with median AUC values of 0.92 and 0.93, and median AUPRC values of 0.038 and 0.041, respectively. The baseline for the AUPRC value for this data set was 0.005. Furthermore, the results indicated that resampling techniques are necessary for successful classification of the minority class. Thorough pre-processing and a balanced split between the test and training sets are crucial before applying resampling, as this technique is sensitive to noisy data. While feature selection improved performance to some extent, it could also lead to unreadable results due to noise. The choice of classifier did not have an equal impact on model performance compared to the effects of resampling and feature selection.
24

Machine Learning for Improving Detection of Cooling Complications : A case study / Maskininlärning för att förbättra detektering av kylproblem

Bruksås Nybjörk, William January 2022 (has links)
The growing market for cold chain pharmaceuticals requires reliable and flexible logistics solutions that ensure the quality of the drugs. These pharmaceuticals must maintain cool to retain the function and effect. Therefore, it is of greatest concern to keep these drugs within the specified temperature interval. Temperature controllable containers are a common logistic solution for cold chain pharmaceuticals freight. One of the leading manufacturers of these containers provides lease and shipment services while also regularly assessing the cooling function. A method is applied for detecting cooling issues and preventing impaired containers to be sent to customers. However, the method tends to miss-classify containers, missing some faulty containers while also classifying functional containers as faulty. This thesis aims to investigate and identify the dependent variables associated with the cooling performance, then Machine Learning will be performed for evaluating if recall and precision could be improved. An improvement could lead to faster response, less waste and even more reliable freight which could be vital for both companies and patients. The labeled dataset has a binary outcome (no cooling issues, cooling issues) and is heavily imbalanced since the containers have high quality and undergo frequent testing and maintenance. Therefore, just a small amount has cooling issues. After analyzing the data, extensive deviations were identified which suggested that the labeled data was misclassified. The believed misclassification was corrected and compared to the original data. A Random Forest classifier in combination with random oversampling and threshold tuning resulted in the best performance for the corrected class labels. Recall reached 86% and precision 87% which is a very promising result. A Random Forest classifier in combination with random oversampling resulted in the best score for the original class labels. Recall reached 77% and precision 44% which is much lower than the adjusted class labels but still displayed a valid result in context of the believed extent of misclassification. Power output variables, compressor error variables and standard deviation of inside temperature were found clear connection toward cooling complications. Clear links could also be found to the critical cases where set temperature could not be met. These cases could therefore be easily detected but harder to prevent since they often appeared without warning. / Den växande marknaden för läkemedel beroende av kylkedja kräver pålitliga och agila logistiska lösningar som försäkrar kvaliteten hos läkemedlen. Dessa läkemedel måste förbli kylda för att behålla funktion och effekt. Därför är det av största vikt att hålla läkemedlen inom det angivna temperaturintervallet. Temperaturkontrollerade containrar är en vanlig logistisk lösning vid kylkedjefrakt av läkemedel. En av de ledande tillverkarna av dessa containrar tillhandahåller uthyrning och frakttjänster av dessa medan de också regelbundet bedömer containrarnas kylfunktion. En metod används för att detektera kylproblem och förhindra skadade containrar från att nå kund. Dock så tenderar denna metod att missklassificera containrar genom att missa vissa containrar med kylproblem och genom att klassificera fungerande containrar som skadade. Den här uppsatsen har som syfte att identifiera beroende variabler kopplade mot kylprestandan och därefter undersöka om maskininlärning kan användas för att förbättra återkallelse och precisionsbetyg gällande containrar med kylproblem. En förbättring kan leda till snabbare respons, mindre resursslöseri och ännu pålitligare frakt vilket kan vara vitalt för både företag som patienter. Ett märkt dataset tillhandahålls och detta har ett binärt utfall (inga kylproblem, kylproblem). Datasetet är kraftigt obalanserat då containrar har en hög kvalité och genomgår frekvent testning och underhåll. Därför har enbart en liten del av containrarna kylproblem. Efter att ha analyserat datan så kunde omfattande avvikelser upptäckas vilket antydde på grov miss-klassificering. Den trodda missklassificeringen korrigerades och jämfördes med den originella datan. En Random Forest klassificerare i kombination med slumpmässig översampling och tröskeljustering gav det bästa resultatet för det korrigerade datasetet. En återkallelse på 86% och en precision på 87% nåddes, vilket var ett lovande resultat. En Random Forest klassificerare i kombination med slumpmässig översampling gav det bästa resultatet för det originella datasetet. En återkallelse på 77% och en precision på 44% nåddes. Detta var mycket lägre än det justerade datasetet men det presenterade fortfarande godkända resultat med åtanke på den trodda missklassificeringen. Variabler baserade på uteffekt, kompressorfel och standardavvikelse av innetemperatur hade tydliga kopplingar mot kylproblem. Tydliga kopplingar kunde även identifieras bland de kritiska fallen där temperaturen ej kunde bibehållas. Dessa fall kunde därmed lätt detekteras men var svårare att förhindra då dessa ofta uppkom utan förvarning.
25

Anomaly Detection in Categorical Data with Interpretable Machine Learning : A random forest approach to classify imbalanced data

Yan, Ping January 2019 (has links)
Metadata refers to "data about data", which contains information needed to understand theprocess of data collection. In this thesis, we investigate if metadata features can be usedto detect broken data and how a tree-based interpretable machine learning algorithm canbe used for an effective classification. The goal of this thesis is two-fold. Firstly, we applya classification schema using metadata features for detecting broken data. Secondly, wegenerate the feature importance rate to understand the model’s logic and reveal the keyfactors that lead to broken data. The given task from the Swedish automotive company Veoneer is a typical problem oflearning from extremely imbalanced data set, with 97 percent of data belongs healthy dataand only 3 percent of data belongs to broken data. Furthermore, the whole data set containsonly categorical variables in nominal scales, which brings challenges to the learningalgorithm. The notion of handling imbalanced problem for continuous data is relativelywell-studied, but for categorical data, the solution is not straightforward. In this thesis, we propose a combination of tree-based supervised learning and hyperparametertuning to identify the broken data from a large data set. Our methods arecomposed of three phases: data cleaning, which is eliminating ambiguous and redundantinstances, followed by the supervised learning algorithm with random forest, lastly, weapplied a random search for hyper-parameter optimization on random forest model. Our results show empirically that tree-based ensemble method together with a randomsearch for hyper-parameter optimization have made improvement to random forest performancein terms of the area under the ROC. The model outperformed an acceptableclassification result and showed that metadata features are capable of detecting brokendata and providing an interpretable result by identifying the key features for classificationmodel.
26

Investigation of training data issues in ensemble classification based on margin concept : application to land cover mapping / Investigation des problèmes des données d'apprentissage en classification ensembliste basée sur le concept de marge : application à la cartographie d'occupation du sol

Feng, Wei 19 July 2017 (has links)
La classification a été largement étudiée en apprentissage automatique. Les méthodes d’ensemble, qui construisent un modèle de classification en intégrant des composants d’apprentissage multiples, atteignent des performances plus élevées que celles d’un classifieur individuel. La précision de classification d’un ensemble est directement influencée par la qualité des données d’apprentissage utilisées. Cependant, les données du monde réel sont souvent affectées par les problèmes de bruit d’étiquetage et de déséquilibre des données. La marge d'ensemble est un concept clé en apprentissage d'ensemble. Elle a été utilisée aussi bien pour l'analyse théorique que pour la conception d'algorithmes d'apprentissage automatique. De nombreuses études ont montré que la performance de généralisation d'un classifieur ensembliste est liée à la distribution des marges de ses exemples d'apprentissage. Ce travail se focalise sur l'exploitation du concept de marge pour améliorer la qualité de l'échantillon d'apprentissage et ainsi augmenter la précision de classification de classifieurs sensibles au bruit, et pour concevoir des ensembles de classifieurs efficaces capables de gérer des données déséquilibrées. Une nouvelle définition de la marge d'ensemble est proposée. C'est une version non supervisée d'une marge d'ensemble populaire. En effet, elle ne requière pas d'étiquettes de classe. Les données d'apprentissage mal étiquetées sont un défi majeur pour la construction d'un classifieur robuste que ce soit un ensemble ou pas. Pour gérer le problème d'étiquetage, une méthode d'identification et d'élimination du bruit d'étiquetage utilisant la marge d'ensemble est proposée. Elle est basée sur un algorithme existant d'ordonnancement d'instances erronées selon un critère de marge. Cette méthode peut atteindre un taux élevé de détection des données mal étiquetées tout en maintenant un taux de fausses détections aussi bas que possible. Elle s'appuie sur les valeurs de marge des données mal classifiées, considérant quatre différentes marges d'ensemble, incluant la nouvelle marge proposée. Elle est étendue à la gestion de la correction du bruit d'étiquetage qui est un problème plus complexe. Les instances de faible marge sont plus importantes que les instances de forte marge pour la construction d'un classifieur fiable. Un nouvel algorithme, basé sur une fonction d'évaluation de l'importance des données, qui s'appuie encore sur la marge d'ensemble, est proposé pour traiter le problème de déséquilibre des données. Cette méthode est évaluée, en utilisant encore une fois quatre différentes marges d'ensemble, vis à vis de sa capacité à traiter le problème de déséquilibre des données, en particulier dans un contexte multi-classes. En télédétection, les erreurs d'étiquetage sont inévitables car les données d'apprentissage sont typiquement issues de mesures de terrain. Le déséquilibre des données d'apprentissage est un autre problème fréquent en télédétection. Les deux méthodes d'ensemble proposées, intégrant la définition de marge la plus pertinente face à chacun de ces deux problèmes majeurs affectant les données d'apprentissage, sont appliquées à la cartographie d'occupation du sol. / Classification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers.
27

SCUT-DS: Methodologies for Learning in Imbalanced Data Streams

Olaitan, Olubukola January 2018 (has links)
The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that instances of some classes occur more frequently than others. Classes with the frequently occurring instances are referred to as the majority classes, while the classes with instances that occur less frequently are denoted as the minority classes. Classification algorithms, or supervised learning techniques, use historic instances to build models, which are then used to predict the classes of unseen instances. Multi-class imbalanced data stream classification poses a great challenge to classical classification algorithms. This is due to the fact that traditional algorithms are usually biased towards the majority classes, since they have more examples of the majority classes when building the model. These traditional algorithms yield low predictive accuracy rates for the minority instances and need to be augmented, often with some form of sampling, in order to improve their overall performances. In the literature, in both static and streaming environments, most studies focus on the binary class imbalance problem. Furthermore, research in multi-class imbalance in the data stream environment is limited. A number of researchers have proceeded by transforming a multi-class imbalanced setting into multiple binary class problems. However, such a transformation does not allow the stream to be studied in the original form and may introduce bias. The research conducted in this thesis aims to address this research gap by proposing a novel online learning methodology that combines oversampling of the minority classes with cluster-based majority class under-sampling, without decomposing the data stream into multiple binary sets. Rather, sampling involves continuously selecting a balanced number of instances across all classes for model building. Our focus is on improving the rate of correctly predicting instances of the minority classes in multi-class imbalanced data streams, through the introduction of the Synthetic Minority Over-sampling Technique (SMOTE) and Cluster-based Under-sampling - Data Streams (SCUT-DS) methodologies. In this work, we dynamically balance the classes by utilizing a windowing mechanism during the incremental sampling process. Our SCUT-DS algorithms are evaluated using six different types of classification techniques, followed by comparing their results against a state-of-the-art algorithm. Our contributions are tested using both synthetic and real data sets. The experimental results show that the approaches developed in this thesis yield high prediction rates of minority instances as contained in the multiple minority classes within a non-evolving stream.
28

Analysing and predicting differences between methylated and unmethylated DNA sequence features

Ali, Isse January 2015 (has links)
DNA methylation is involved in various biological phenomena, and its dysregulation has been demonstrated as being correlated with a number of human disease processes, including cancers, autism, and autoimmune, mental health and neuro-degenerative ones. It has become important and useful in characterising and modelling these biological phenomena in or-der to understand the mechanism of such occurrences, in relation to both health and disease. An attempt has previously been made to map DNA methylation across human tissues, however, the means of distinguishing between methylated, unmethylated and differentially-methylated groups using DNA sequence features remains unclear. The aim of this study is therefore to: firstly, investigate DNA methylation classes and predict these based on DNA sequence features; secondly, to further identify methylation-associated DNA sequence features, and distinguish methylation differences between males and females in relation to both healthy and diseased, sta-tuses. This research is conducted in relation to three samples within nine biological feature sub-sets extracted from DNA sequence patterns (Human genome database). Two samples contain classes (methylated, unmethy-lated and differentially-methylated) within a total of 642 samples with 3,809 attributes driven from four human chromosomes, i.e. chromosomes 6, 20, 21 and 22, and the third sample contains all human chromosomes, which encompasses 1628 individuals, and then 1,505 CpG loci (features) were extracted by using Hierarchical clustering (a process Heatmap), along with pair correlation distance and then applied feature selection methods. From this analysis, author extract 47 features associated with gender and age, with 17 revealing significant methylation differences between males and females. Methylation classes prediction were applied a K-nearest Neighbour classifier, combined with a ten-fold cross- validation, since to some data were severely imbalanced (i.e., existed in sub-classes), and it has been established that direct analysis in machine-learning is biased towards the majority class. Hence, author propose a Modified- Leave-One-Out (MLOO) cross-validation and AdaBoost methods to tackle these issues, with the aim of compositing a balanced outcome and limiting the bias in-terference from inter-differences of the classes involved, which has provided potential predictive accuracies between 75% and 100%, based on the DNA sequence context.
29

Utveckling av beslutsstöd för kreditvärdighet

Arvidsson, Martin, Paulsson, Eric January 2013 (has links)
The aim is to develop a new decision-making model for credit-loans. The model will be specific for credit applicants of the OKQ8 bank, becauseit is based on data of earlier applicants of credit from the client (the bank). The final model is, in effect, functional enough to use informationabout a new applicant as input, and predict the outcome to either the good risk group or the bad risk group based on the applicant’s properties.The prediction may then lay the foundation for the decision to grant or deny credit loan. Because of the skewed distribution in the response variable, different sampling techniques are evaluated. These include oversampling with SMOTE, random undersampling and pure oversampling in the form of scalar weighting of the minority class. It is shown that the predictivequality of a classifier is affected by the distribution of the response, and that the oversampled information is not too redundant. Three classification techniques are evaluated. Our results suggest that a multi-layer neural network with 18 neurons in a hidden layer, equippedwith an ensemble technique called boosting, gives the best predictive power. The most successful model is based on a feed forward structure andtrained with a variant of back-propagation using conjugate-gradient optimization. Two other models with a good prediction quality are developed using logistic regression and a decision tree classifier, but they do not reach thelevel of the network. However, the results of these models are used to answer the question regarding which customer properties are importantwhen determining credit risk. Two examples of important customer properties are income and the number of earlier credit reports of the applicant. Finally, we use the best classification model to predict the outcome of a set of applicants declined by the existent filter. The results show that thenetwork model accepts over 60 % of the applicants who had previously been denied credit. This may indicate that the client’s suspicionsregarding that the existing model is too restrictive, in fact are true.
30

Machine Learning for Classification of Temperature Controlled Containers Using Heavily Imbalanced Data / Maskininlärning för klassificering av temperatur reglerbara containrar genom användande av extremt obalanserad data

Ranjith, Adam January 2022 (has links)
Temperature controllable containers are used frequently in order to transport pharmaceutical cargo all around the world. One of the leading manufacturing companies of these containers has a method for detecting containers with a faulty cooling system before making a shipment. However, the problem with this method is that the model tends to miss-classify containers. Hence, this thesis aims to investigate if machine learning usage would make classification of containers more accurate. Nonetheless, there is a problem, the data set is extremely imbalanced. If machine learning can be used to improve container manufacturing companies fault detection systems, it would imply less damaged and delayed pharmaceutical cargo which could be vital. Various combinations of machine learning classifiers and techniques for handling the imbalance were tested in order to find the most optimal one. The Random Forest classifier when using oversampling was the best performing combination which performed about equally as good as the company’s current method, with a recall score of 92% and a precision score of 34%. Earlier there were no known papers on machine learning for classification of temperature controllable containers. However, now other manufacturing companies could favourably use the concepts and methods presented in this thesis in order to enhance the effectiveness of their fault detection systems and consequently improve the overall shipping efficiency of pharmaceutical cargo. / Temperatur reglerbara containrar används frekvent inom medicinsk transport runt om i hela världen. Ett ledande företag som är tillverkare av dessa containrar använder sig av en metod för att upptäcka containrar med ett felaktigt kylsystem redan innan de hunnit ut på en transport. Denna metod är fungerande men inte perfekt då den tenderar att felaktigt klassificera containrar. Detta examensarbete är en utredande avhandling för att ta reda på om maskininlärning kan användas för att förbättra klassificeringen av containrar. Det finns dock ett problem, data setet är extremt obalanserat. Om maskininlärning kan användas för att förbättra felsökningssystemen hos tillverkare av temperatur reglerbara containrar skulle det innebära mindre förstörda samt mindre försenade medicinska transporter vilket kan vara livsavgörande. Ett urval av kombinationer mellan maskininlärnings modeller och tekniker för att hantera obalanserad data testade för att avgöra vilken som är optimal. Klassificeraren Random Forest ihop med över-sampling resulterade i best prestanda, ungefär lika bra som företagets nuvarande metod. Tidigare har det inte funnits några kända rapporter om användning av maskininlärning för att klassificera temperaturer reglerbara containrar. Nu kan dock andra tillverkare av containrar använda sig av koncept och metoder som presenterades i avhandlingen för att optimera deras felsökningssystem och således förbättra den allmänna effektiviteten inom medicinsk transport.

Page generated in 0.0597 seconds