Global ETD Search

51	Exploring Alarm Data for Improved Return Prediction in Radios : A Study on Imbalanced Data Classification Färenmark, Sofia January 2023 (has links) The global tech company Ericsson has been tracking the return rate of their products for over 30 years, using it as a key performance indicator (KPI). These KPIs play a critical role in making sound business decisions, identifying areas for improvement, and planning. To enhance the customer experience, the company highly values the ability to predict the number of returns in advance each month. However, predicting returns is a complex problem affected by multiple factors that determine when radios are returned. Analysts at the company have observed indications of a potential correlation between alarm data and the number of returns. This paper aims to address the need for better prediction models to improve return rate forecasting for radios, utilizing alarm data. The alarm data, which is stored in an internal database, includes logs of activated alarms at various sites, along with technical and logistical information about the products, as well as the historical records of returns. The problem is approached as a classification task, where radios are classified as either "return" or "no return" for a specific month, using the alarm dataset as input. However, due to the significantly smaller number of returned radios compared to the distributed ones, the dataset suffers from a heavy class imbalance. The imbalance class problem has garnered considerable attention in the field of machine learning in recent years, as traditional classification models struggle to identify patterns in the minority class of imbalanced datasets. Therefore, a specific method that addresses the imbalanced class problem was required to construct an effective prediction model for returns. Therefore, this paper has adopted a systematic approach inspired by similar problems. It applies the feature selection methods LASSO and Boruta, along with the resampling technique SMOTE, and evaluates various classifiers including the Support vector machine (SVM), Random Forest classifier (RFC), Decision tree (DT), and a Neural network (NN) with weights to identify the best-performing model. As accuracy is not suitable as an evaluation metric for imbalanced datasets, the AUC and AUPRC values were calculated for all models to assess the impact of feature selection, weights, resampling techniques, and the choice of classifier. The best model was determined to be the NN with weights, achieving a median AUC value of 0.93 and a median AUPRC value of 0.043. Likewise, both the LASSO+SVM+SMOTE and LASSO+RFC+SMOTE models demonstrated similar performance with median AUC values of 0.92 and 0.93, and median AUPRC values of 0.038 and 0.041, respectively. The baseline for the AUPRC value for this data set was 0.005. Furthermore, the results indicated that resampling techniques are necessary for successful classification of the minority class. Thorough pre-processing and a balanced split between the test and training sets are crucial before applying resampling, as this technique is sensitive to noisy data. While feature selection improved performance to some extent, it could also lead to unreadable results due to noise. The choice of classifier did not have an equal impact on model performance compared to the effects of resampling and feature selection. Imbalanced data classification LASSO Boruta SVM RFC neural network decision tree AUC AUPRC Computer Sciences Datavetenskap (datalogi)
52	Deep Contrastive Metric Learning to Detect Polymicrogyria in Pediatric Brain MRI Zhang, Lingfeng 28 November 2022 (has links) Polymicrogyria (PMG) is one brain disease that mainly occurs in the pediatric brain. Heavy PMG will cause seizures, delayed development, and a series of problems. For this reason, it is critical to effectively identify PMG and start early treatment. Radiologists typically identify PMG through magnetic resonance imaging scans. In this study, we create and open a pediatric MRI dataset (named PPMR dataset) including PMG and controls from the Children's Hospital of Eastern Ontario (CHEO), Ottawa, Canada. The difference between PMG MRIs and control MRIs is subtle and the true distribution of the features of the disease is unknown. Hence, we propose a novel center-based deep contrastive metric learning loss function (named cDCM Loss) to deal with this difficult problem. Cross-entropy-based loss functions do not lead to models with good generalization on small and imbalanced dataset with partially known distributions. We conduct exhaustive experiments on a modified CIFAR-10 dataset to demonstrate the efficacy of our proposed loss function compared to cross-entropy-based loss functions and the state-of-the-art Deep SAD loss function. Additionally, based on our proposed loss function, we customize a deep learning model structure that integrates dilated convolution, squeeze-and-excitation blocks and feature fusion for our PPMR dataset, to achieve 92.01% recall. Since our suggested method is a computer-aided tool to assist radiologists in selecting potential PMG MRIs, 55.04% precision is acceptable. To our best knowledge, this research is the first to apply machine learning techniques to identify PMG only from MRI and our innovative method achieves better results than baseline methods. Polymicrogyria Pediatric Brain MRI Images Small and Imbalanced Datasets Out of Distribution Deep Metric Learning Supervised Anomaly Detection Convolutional Neural Networks
53	Machine Learning for Improving Detection of Cooling Complications : A case study / Maskininlärning för att förbättra detektering av kylproblem Bruksås Nybjörk, William January 2022 (has links) The growing market for cold chain pharmaceuticals requires reliable and flexible logistics solutions that ensure the quality of the drugs. These pharmaceuticals must maintain cool to retain the function and effect. Therefore, it is of greatest concern to keep these drugs within the specified temperature interval. Temperature controllable containers are a common logistic solution for cold chain pharmaceuticals freight. One of the leading manufacturers of these containers provides lease and shipment services while also regularly assessing the cooling function. A method is applied for detecting cooling issues and preventing impaired containers to be sent to customers. However, the method tends to miss-classify containers, missing some faulty containers while also classifying functional containers as faulty. This thesis aims to investigate and identify the dependent variables associated with the cooling performance, then Machine Learning will be performed for evaluating if recall and precision could be improved. An improvement could lead to faster response, less waste and even more reliable freight which could be vital for both companies and patients. The labeled dataset has a binary outcome (no cooling issues, cooling issues) and is heavily imbalanced since the containers have high quality and undergo frequent testing and maintenance. Therefore, just a small amount has cooling issues. After analyzing the data, extensive deviations were identified which suggested that the labeled data was misclassified. The believed misclassification was corrected and compared to the original data. A Random Forest classifier in combination with random oversampling and threshold tuning resulted in the best performance for the corrected class labels. Recall reached 86% and precision 87% which is a very promising result. A Random Forest classifier in combination with random oversampling resulted in the best score for the original class labels. Recall reached 77% and precision 44% which is much lower than the adjusted class labels but still displayed a valid result in context of the believed extent of misclassification. Power output variables, compressor error variables and standard deviation of inside temperature were found clear connection toward cooling complications. Clear links could also be found to the critical cases where set temperature could not be met. These cases could therefore be easily detected but harder to prevent since they often appeared without warning. / Den växande marknaden för läkemedel beroende av kylkedja kräver pålitliga och agila logistiska lösningar som försäkrar kvaliteten hos läkemedlen. Dessa läkemedel måste förbli kylda för att behålla funktion och effekt. Därför är det av största vikt att hålla läkemedlen inom det angivna temperaturintervallet. Temperaturkontrollerade containrar är en vanlig logistisk lösning vid kylkedjefrakt av läkemedel. En av de ledande tillverkarna av dessa containrar tillhandahåller uthyrning och frakttjänster av dessa medan de också regelbundet bedömer containrarnas kylfunktion. En metod används för att detektera kylproblem och förhindra skadade containrar från att nå kund. Dock så tenderar denna metod att missklassificera containrar genom att missa vissa containrar med kylproblem och genom att klassificera fungerande containrar som skadade. Den här uppsatsen har som syfte att identifiera beroende variabler kopplade mot kylprestandan och därefter undersöka om maskininlärning kan användas för att förbättra återkallelse och precisionsbetyg gällande containrar med kylproblem. En förbättring kan leda till snabbare respons, mindre resursslöseri och ännu pålitligare frakt vilket kan vara vitalt för både företag som patienter. Ett märkt dataset tillhandahålls och detta har ett binärt utfall (inga kylproblem, kylproblem). Datasetet är kraftigt obalanserat då containrar har en hög kvalité och genomgår frekvent testning och underhåll. Därför har enbart en liten del av containrarna kylproblem. Efter att ha analyserat datan så kunde omfattande avvikelser upptäckas vilket antydde på grov miss-klassificering. Den trodda missklassificeringen korrigerades och jämfördes med den originella datan. En Random Forest klassificerare i kombination med slumpmässig översampling och tröskeljustering gav det bästa resultatet för det korrigerade datasetet. En återkallelse på 86% och en precision på 87% nåddes, vilket var ett lovande resultat. En Random Forest klassificerare i kombination med slumpmässig översampling gav det bästa resultatet för det originella datasetet. En återkallelse på 77% och en precision på 44% nåddes. Detta var mycket lägre än det justerade datasetet men det presenterade fortfarande godkända resultat med åtanke på den trodda missklassificeringen. Variabler baserade på uteffekt, kompressorfel och standardavvikelse av innetemperatur hade tydliga kopplingar mot kylproblem. Tydliga kopplingar kunde även identifieras bland de kritiska fallen där temperaturen ej kunde bibehållas. Dessa fall kunde därmed lätt detekteras men var svårare att förhindra då dessa ofta uppkom utan förvarning. Temperature controllable containers Machine Learning imbalanced data data analysis noisy labels feature engineering threshold tuning Mechanical Engineering Maskinteknik
54	Strong interaction between two co-rotating vortices in rotating and stratified flows Bambrey, Ross R. January 2007 (has links) In this study we investigate the interactions between two co-rotating vortices. These vortices are subject to rapid rotation and stable stratification such as are found in planetary atmospheres and oceans. By conducting a large number of simulations of vortex interactions, we intend to provide an overview of the interactions that could occur in geophysical turbulence. We consider a wide parameter space covering the vortices height-to-width aspect-ratios, their volume ratios and the vertical offset between them. The vortices are initially separated in the horizontal so that they reside at an estimated margin of stability. The vortices are then allowed to evolve for a period of approximately 20 vortex revolutions. We find that the most commonly observed interaction under the quasi-geostrophic (QG) regime is partial-merger, where only part of the smaller vortex is incorporated into the larger, stronger vortex. On the other hand, a large number of filamentary and small scale structures are generated during the interaction. We find that, despite the proliferation of small-scale structures, the self-induced vortex energy exhibits a mean `inverse-cascade' to larger scale structures. Interestingly we observe a range of intermediate-scale structures that are preferentially sheared out during the interactions, leaving two vortex populations, one of large-scale vortices and one of small-scale vortices. We take a subset of the parameter space used for the QG study and perform simulations using a non-hydrostatic model. This system, free of the layer-wise two-dimensional constraints and geostrophic balance of the QG model, allows for the generation of inertia-gravity waves and ageostrophic advection. The study of the interactions between two co-rotating, non-hydrostatic vortices is performed over four different Rossby numbers, two positive and two negative, allowing for the comparison of cyclonic and anti-cyclonic interactions. It is found that a greater amount of wave-like activity is generated during the interactions in anticyclonic situations. We also see distinct qualitative differences between the interactions for cyclonic and anti-cyclonic regimes. 550
55	Development of artificial intelligence-based in-silico toxicity models : data quality analysis and model performance enhancement through data generation Malazizi, Ladan January 2008 (has links) Toxic compounds, such as pesticides, are routinely tested against a range of aquatic, avian and mammalian species as part of the registration process. The need for reducing dependence on animal testing has led to an increasing interest in alternative methods such as in silico modelling. The QSAR (Quantitative Structure Activity Relationship)-based models are already in use for predicting physicochemical properties, environmental fate, eco-toxicological effects, and specific biological endpoints for a wide range of chemicals. Data plays an important role in modelling QSARs and also in result analysis for toxicity testing processes. This research addresses number of issues in predictive toxicology. One issue is the problem of data quality. Although large amount of toxicity data is available from online sources, this data may contain some unreliable samples and may be defined as of low quality. Its presentation also might not be consistent throughout different sources and that makes the access, interpretation and comparison of the information difficult. To address this issue we started with detailed investigation and experimental work on DEMETRA data. The DEMETRA datasets have been produced by the EC-funded project DEMETRA. Based on the investigation, experiments and the results obtained, the author identified a number of data quality criteria in order to provide a solution for data evaluation in toxicology domain. An algorithm has also been proposed to assess data quality before modelling. Another issue considered in the thesis was the missing values in datasets for toxicology domain. Least Square Method for a paired dataset and Serial Correlation for single version dataset provided the solution for the problem in two different situations. A procedural algorithm using these two methods has been proposed in order to overcome the problem of missing values. Another issue we paid attention to in this thesis was modelling of multi-class data sets in which the severe imbalance class samples distribution exists. The imbalanced data affect the performance of classifiers during the classification process. We have shown that as long as we understand how class members are constructed in dimensional space in each cluster we can reform the distribution and provide more knowledge domain for the classifier. 615.9
56	Anomaly Detection in Categorical Data with Interpretable Machine Learning : A random forest approach to classify imbalanced data Yan, Ping January 2019 (has links) Metadata refers to "data about data", which contains information needed to understand theprocess of data collection. In this thesis, we investigate if metadata features can be usedto detect broken data and how a tree-based interpretable machine learning algorithm canbe used for an effective classification. The goal of this thesis is two-fold. Firstly, we applya classification schema using metadata features for detecting broken data. Secondly, wegenerate the feature importance rate to understand the model’s logic and reveal the keyfactors that lead to broken data. The given task from the Swedish automotive company Veoneer is a typical problem oflearning from extremely imbalanced data set, with 97 percent of data belongs healthy dataand only 3 percent of data belongs to broken data. Furthermore, the whole data set containsonly categorical variables in nominal scales, which brings challenges to the learningalgorithm. The notion of handling imbalanced problem for continuous data is relativelywell-studied, but for categorical data, the solution is not straightforward. In this thesis, we propose a combination of tree-based supervised learning and hyperparametertuning to identify the broken data from a large data set. Our methods arecomposed of three phases: data cleaning, which is eliminating ambiguous and redundantinstances, followed by the supervised learning algorithm with random forest, lastly, weapplied a random search for hyper-parameter optimization on random forest model. Our results show empirically that tree-based ensemble method together with a randomsearch for hyper-parameter optimization have made improvement to random forest performancein terms of the area under the ROC. The model outperformed an acceptableclassification result and showed that metadata features are capable of detecting brokendata and providing an interpretable result by identifying the key features for classificationmodel. machine learning decision tree imbalanced data anomaly detection random forest maskininlärning beslut träd obalanserat data anomalitetsdetektering Probability Theory and Statistics Sannolikhetsteori och statistik
57	SEX COMPOSITION AND FEMALE OFFENDING: UNDER THE IMPACT OF THE ONE-CHILD POLICY Wang, Ting 01 January 2018 (has links) This dissertation explores the mechanisms of the increasing female crime in China from the effect of the one-child policy, which is treated herein as a natural experiment. Data reveal that the women’s share of documented crime dramatically increased after the mid-1990s when the first one-child generation reached the age of legal responsibility. This change reflects the interplay of the behavioral change and the net-widening effect. The increasing criminality of the one-child generation is attributable to the gap between the equal gender expectations of the individual, which has been reshaped by the unique socialization practices under the influence of the policy, and a stubbornly unequal gender hierarchy in the society. As a result, the one-child-generation women who disproportionately suffer the resulting strains are more likely to become involved in property and occupational crime as the alternative means to fulfill their aspirations for economic success. Additionally, the effect of the policy affects not only the individual gender roles of the only children but also their peers who have siblings through the intermediary of a culture shift. Therefore, the policy has changed the behavior of a whole new generation through the process of socialization and the lag in the structural change. The net-widening effect is another pathway of the unequal gender structure and ideologies to the increasing female crime. Moral panic associated with the emergence of diverse forms of female offenses lead to an inordinate degree of adverse attention focused upon the one-child-generation women by criminal justice professionals. The increased criminalization of sexuality brought an increasing number of the one-child-generation women into conflict with the law, usually for prostituting themselves for easy money to fulfill their material satisfaction. Consequently, the one-child-generation female offenders are perceived as “doubly deviant” from the law and from the socially prescribed ideology of gender and are, therefore, punished more harshly than before by the criminal justice system. This dissertation not only explores an understudied country in criminological research but also seeks to apply the findings to a broad sphere to explain the increasing female crime that has been observed worldwide. It disentangles the theoretical controversy in explaining the increase in the share of crime committed by women in general by embedding the argument in a multidimensional gender role repertoire. Female Crime One-child Policy Imbalanced Sex Ratios Gender Role Repertoire Gender Inequality Criminology Gender and Sexuality Social Control, Law, Crime, and Deviance
58	Investigation of training data issues in ensemble classification based on margin concept : application to land cover mapping / Investigation des problèmes des données d'apprentissage en classification ensembliste basée sur le concept de marge : application à la cartographie d'occupation du sol Feng, Wei 19 July 2017 (has links) La classification a été largement étudiée en apprentissage automatique. Les méthodes d’ensemble, qui construisent un modèle de classification en intégrant des composants d’apprentissage multiples, atteignent des performances plus élevées que celles d’un classifieur individuel. La précision de classification d’un ensemble est directement influencée par la qualité des données d’apprentissage utilisées. Cependant, les données du monde réel sont souvent affectées par les problèmes de bruit d’étiquetage et de déséquilibre des données. La marge d'ensemble est un concept clé en apprentissage d'ensemble. Elle a été utilisée aussi bien pour l'analyse théorique que pour la conception d'algorithmes d'apprentissage automatique. De nombreuses études ont montré que la performance de généralisation d'un classifieur ensembliste est liée à la distribution des marges de ses exemples d'apprentissage. Ce travail se focalise sur l'exploitation du concept de marge pour améliorer la qualité de l'échantillon d'apprentissage et ainsi augmenter la précision de classification de classifieurs sensibles au bruit, et pour concevoir des ensembles de classifieurs efficaces capables de gérer des données déséquilibrées. Une nouvelle définition de la marge d'ensemble est proposée. C'est une version non supervisée d'une marge d'ensemble populaire. En effet, elle ne requière pas d'étiquettes de classe. Les données d'apprentissage mal étiquetées sont un défi majeur pour la construction d'un classifieur robuste que ce soit un ensemble ou pas. Pour gérer le problème d'étiquetage, une méthode d'identification et d'élimination du bruit d'étiquetage utilisant la marge d'ensemble est proposée. Elle est basée sur un algorithme existant d'ordonnancement d'instances erronées selon un critère de marge. Cette méthode peut atteindre un taux élevé de détection des données mal étiquetées tout en maintenant un taux de fausses détections aussi bas que possible. Elle s'appuie sur les valeurs de marge des données mal classifiées, considérant quatre différentes marges d'ensemble, incluant la nouvelle marge proposée. Elle est étendue à la gestion de la correction du bruit d'étiquetage qui est un problème plus complexe. Les instances de faible marge sont plus importantes que les instances de forte marge pour la construction d'un classifieur fiable. Un nouvel algorithme, basé sur une fonction d'évaluation de l'importance des données, qui s'appuie encore sur la marge d'ensemble, est proposé pour traiter le problème de déséquilibre des données. Cette méthode est évaluée, en utilisant encore une fois quatre différentes marges d'ensemble, vis à vis de sa capacité à traiter le problème de déséquilibre des données, en particulier dans un contexte multi-classes. En télédétection, les erreurs d'étiquetage sont inévitables car les données d'apprentissage sont typiquement issues de mesures de terrain. Le déséquilibre des données d'apprentissage est un autre problème fréquent en télédétection. Les deux méthodes d'ensemble proposées, intégrant la définition de marge la plus pertinente face à chacun de ces deux problèmes majeurs affectant les données d'apprentissage, sont appliquées à la cartographie d'occupation du sol. / Classification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers. Bagging Classification Apprentissage d’ensemble Marge d’ensemble Données déséquilibrées Données mal étiquetées Forêts aléatoires Télédétection Bagging Classification Ensemble learning Ensemble margin Imbalanced data Mislabeled data Random forests Remote sensing
59	Random forest em dados desbalanceados: uma aplicação na modelagem de churn em seguro saúde Lento, Gabriel Carneiro 27 March 2017 (has links) Submitted by Gabriel Lento (gabriel.carneiro.lento@gmail.com) on 2017-05-01T23:16:04Z No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) / Approved for entry into archive by Leiliane Silva (leiliane.silva@fgv.br) on 2017-05-04T18:39:57Z (GMT) No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) / Made available in DSpace on 2017-05-17T12:43:35Z (GMT). No. of bitstreams: 1 Dissertação Gabriel Carneiro Lento.pdf: 832965 bytes, checksum: f79e7cb4e5933fd8c3a7c67ed781ddb5 (MD5) Previous issue date: 2017-03-27 / In this work we study churn in health insurance, that is predicting which clients will cancel the product or service within a preset time-frame. Traditionally, the probability whether a client will cancel the service is modeled using logistic regression. Recently, modern machine learning techniques are becoming popular in churn modeling, having been applied in the areas of telecommunications, banking, and car insurance, among others. One of the big challenges in this problem is that only a fraction of all customers cancel the service, meaning that we have to deal with highly imbalanced class probabilities. Under-sampling and over-sampling techniques have been used to overcome this issue. We use random forests, that are ensembles of decision trees, where each of the trees fits a subsample of the data constructed using either under-sampling or over-sampling. We compare the distinct specifications of random forests using various metrics that are robust to imbalanced classes, both in-sample and out-of-sample. We observe that random forests using imbalanced random samples with fewer observations than the original series present a better overall performance. Random forests also present a better performance than the classical logistic regression, often used in health insurance companies to model churn. / Neste trabalho estudamos o problema de churn em seguro saúde, isto é, a previsão se o cliente irá cancelar o produto ou serviço em até um período de tempo pré-estipulado. Tradicionalmente, regressão logística é utilizada para modelar a probabilidade de cancelamento do serviço. Atualmente, técnicas modernas de machine learning vêm se tornando cada vez mais populares para esse tipo de problema, com exemplos nas áreas de telecomunicação, bancos, e seguros de carro, dentre outras. Uma das grandes dificuldades nesta modelagem é que apenas uma pequena fração dos clientes de fato cancela o serviço, o que significa que a base de dados tratada é altamente desbalanceada. Técnicas de under-sampling e over-sampling são utilizadas para contornar esse problema. Neste trabalho, aplicamos random forests, que são combinações de árvores de decisão ajustadas em subamostras dos dados, construídas utilizando under-sampling e over-sampling. Ao fim do trabalho comparamos métricas de ajustes obtidas nas diversas especificações dos modelos testados e avaliamos seus resultados dentro e fora da amostra. Observamos que técnicas de random forest utilizando sub-amostras não balanceadas com o tamanho menor do que a amostra original apresenta a melhor performance dentre as random forests utilizadas e uma melhora com relação ao praticado no mercado de seguro saúde. Under-sampling Over-sampling Imbalanced class Health insurance Random forest Churn Dados desbalanceados Seguro saúde Matemática Aprendizado do computador Mineração de dados (Computação) Seguro-saúde
60	Cost-sensitive boosting : a unified approach Nikolaou, Nikolaos January 2016 (has links) In this thesis we provide a unifying framework for two decades of work in an area of Machine Learning known as cost-sensitive Boosting algorithms. This area is concerned with the fact that most real-world prediction problems are asymmetric, in the sense that different types of errors incur different costs. Adaptive Boosting (AdaBoost) is one of the most well-studied and utilised algorithms in the field of Machine Learning, with a rich theoretical depth as well as practical uptake across numerous industries. However, its inability to handle asymmetric tasks has been the subject of much criticism. As a result, numerous cost-sensitive modifications of the original algorithm have been proposed. Each of these has its own motivations, and its own claims to superiority. With a thorough analysis of the literature 1997-2016, we find 15 distinct cost-sensitive Boosting variants - discounting minor variations. We critique the literature using {\em four} powerful theoretical frameworks: Bayesian decision theory, the functional gradient descent view, margin theory, and probabilistic modelling. From each framework, we derive a set of properties which must be obeyed by boosting algorithms. We find that only 3 of the published Adaboost variants are consistent with the rules of all the frameworks - and even they require their outputs to be calibrated to achieve this. Experiments on 18 datasets, across 21 degrees of cost asymmetry, all support the hypothesis - showing that once calibrated, the three variants perform equivalently, outperforming all others. Our final recommendation - based on theoretical soundness, simplicity, flexibility and performance - is to use the original Adaboost algorithm albeit with a shifted decision threshold and calibrated probability estimates. The conclusion is that novel cost-sensitive boosting algorithms are unnecessary if proper calibration is applied to the original. 006.3

Search results