Spelling suggestions: "subject:"imbalanced datasets"" "subject:"imabalanced datasets""
1 |
A Combined Approach to Handle Multi-class Imbalanced Data and to Adapt Concept Drifts using Machine LearningTumati, Saini 05 October 2021 (has links)
No description available.
|
2 |
Classification de bases de données déséquilibrées par des règles de décomposition / Handling imbalanced datasets by reconstruction rules in decomposition schemesD'Ambrosio, Roberto 07 March 2014 (has links)
Le déséquilibre entre la distribution des a priori est rencontré dans un nombre très large de domaines. Les algorithmes d’apprentissage conventionnels sont moins efficaces dans la prévision d’échantillons appartenant aux classes minoritaires. Notre but est de développer une règle de reconstruction adaptée aux catégories de données biaisées. Nous proposons une nouvelle règle, la Reconstruction Rule par sélection, qui, dans le schéma ‘One-per-Class’, utilise la fiabilité, des étiquettes et des distributions a priori pour permettre de calculer une décision finale. Les tests démontrent que la performance du système s’améliore en utilisant cette règle plutôt que des règles classiques. Nous étudions également les règles dans l’ ‘Error Correcting Output Code’ (ECOC) décomposition. Inspiré par une règle de reconstitution de données statistiques conçue pour le ‘One-per-Class’ et ‘Pair-Wise Coupling’ des approches sur la décomposition, nous avons développé une règle qui s’applique à la régression ‘softmax’ sur la fiabilité afin d’évaluer la classification finale. Les résultats montrent que ce choix améliore les performances avec respect de la règle statistique existante et des règles de reconstructions classiques. Sur ce thème d’estimation fiable nous remarquons que peu de travaux ont porté sur l’efficacité de l’estimation postérieure dans le cadre de boosting. Suivant ce raisonnement, nous développons une estimation postérieure efficace en boosting Nearest Neighbors. Utilisant Universal Nearest Neighbours classification nous prouvons qu’il existe une sous-catégorie de fonctions, dont la minimisation apporte statistiquement de simples et efficaces estimateurs de Bayes postérieurs. / Disproportion among class priors is encountered in a large number of domains making conventional learning algorithms less effective in predicting samples belonging to the minority classes. We aim at developing a reconstruction rule suited to multiclass skewed data. In performing this task we use the classification reliability that conveys useful information on the goodness of classification acts. In the framework of One-per-Class decomposition scheme we design a novel reconstruction rule, Reconstruction Rule by Selection, which uses classifiers reliabilities, crisp labels and a-priori distributions to compute the final decision. Tests show that system performance improves using this rule rather than using well-established reconstruction rules. We investigate also the rules in the Error Correcting Output Code (ECOC) decomposition framework. Inspired by a statistical reconstruction rule designed for the One-per-Class and Pair-Wise Coupling decomposition approaches, we have developed a rule that applies softmax regression on reliability outputs in order to estimate the final classification. Results show that this choice improves the performances with respect to the existing statistical rule and to well-established reconstruction rules. On the topic of reliability estimation we notice that small attention has been given to efficient posteriors estimation in the boosting framework. On this reason we develop an efficient posteriors estimator by boosting Nearest Neighbors. Using Universal Nearest Neighbours classifier we prove that a sub-class of surrogate losses exists, whose minimization brings simple and statistically efficient estimators for Bayes posteriors.
|
3 |
Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine LearningPelayo Ramirez, Lourdes Unknown Date
No description available.
|
4 |
Deep Contrastive Metric Learning to Detect Polymicrogyria in Pediatric Brain MRIZhang, Lingfeng 28 November 2022 (has links)
Polymicrogyria (PMG) is one brain disease that mainly occurs in the pediatric brain. Heavy PMG will cause seizures, delayed development, and a series of problems. For this reason, it is critical to effectively identify PMG and start early treatment. Radiologists typically identify PMG through magnetic resonance imaging scans. In this study, we create and open a pediatric MRI dataset (named PPMR dataset) including PMG and controls from the Children's Hospital of Eastern Ontario (CHEO), Ottawa, Canada. The difference between PMG MRIs and control MRIs is subtle and the true distribution of the features of the disease is unknown. Hence, we propose a novel center-based deep contrastive metric learning loss function (named cDCM Loss) to deal with this difficult problem. Cross-entropy-based loss functions do not lead to models with good generalization on small and imbalanced dataset with partially known distributions. We conduct exhaustive experiments on a modified CIFAR-10 dataset to demonstrate the efficacy of our proposed loss function compared to cross-entropy-based loss functions and the state-of-the-art Deep SAD loss function. Additionally, based on our proposed loss function, we customize a deep learning model structure that integrates dilated convolution, squeeze-and-excitation blocks and feature fusion for our PPMR dataset, to achieve 92.01% recall. Since our suggested method is a computer-aided tool to assist radiologists in selecting potential PMG MRIs, 55.04% precision is acceptable. To our best knowledge, this research is the first to apply machine learning techniques to identify PMG only from MRI and our innovative method achieves better results than baseline methods.
|
5 |
Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance TradeoffsHu, Li Ang, Ma, Long January 2023 (has links)
This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy. Synthetic Minority Oversampling Technique (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE) yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets.
|
6 |
Predicting Customer Satisfaction in the Context of Last-Mile Delivery using Supervised and Automatic Machine LearningHöggren, Carl January 2022 (has links)
The prevalence of online shopping has steadily risen in the last few years. In response to these changes, last-mile delivery services have emerged that enable goods to reach customers within a shorter timeframe compared to traditional logistics providers. However, with decreased lead times follows greater exposure to risks that directly influence customer satisfaction. More specifically, this report aims to investigate the extent to which Supervised and Automatic Machine Learning can be leveraged to extract those features that have the highest explanatory power dictating customer ratings. The implementation suggests that Random Forest Classifier outperforms both Multi-Layer Perceptron and Support Vector Machine in predicting customer ratings on a highly imbalanced version of the dataset, while AutoML soars when the dataset is subject to undersampling. Using Permutation Feature Importance and Shapley Additive Explanations, it was further concluded that whether the delivery is on time, whether the delivery is executed within the stated time window, and whether the delivery is executed during the morning, afternoon, or evening, are paramount drivers of customer ratings. / Förekomsten av online-shopping har kraftigt ökat de senaste åren. I kölvattnet av dessa förändringar har flertalet sista-milen företag etablerats som möjliggör för paket att nå kunder inom en kortare tidsperiod jämfört med traditionella logistikföretag. Däremot, med minskade ledtider följer större exponering mot risker som direkt påverkar kundernas upplevelse av sista-milen tjänsten. Givet detta syftar denna rapport till att undersöka huruvida övervakad och automtisk maskininlärning kan användas för att extrahera de parametrar som har störst påverkan på kundnöjdhet. Implementationen visar att slumpmässiga beslutsträd överträffar både neurala nätverk och stödvektorsmaskiner i syfte att förutspå kundnöjdhet på en obalanserad version av träningsdatan, medan automatisk maskininlärning överträffar övriga modeller på en balanserad version. Genom användning av metoderna Permutation Feature Importance och Shapley Additive Explanations, framgick att huruvida paketet är försenad, huruvida paketet levereras inom det angivet tidsfönster, och huruvida paketet anländer under morgonen, eftermiddagen, eller kvällen, har störst påverkan på kundnöjdhet.
|
7 |
Apprentissage supervisé de données déséquilibrées par forêt aléatoire / Supervised learning of imbalanced datasets using random forestThomas, Julien 12 February 2009 (has links)
La problématique des jeux de données déséquilibrées en apprentissage supervisé est apparue relativement récemment, dès lors que le data mining est devenu une technologie amplement utilisée dans l'industrie. Le but de nos travaux est d'adapter différents éléments de l'apprentissage supervisé à cette problématique. Nous cherchons également à répondre aux exigences spécifiques de performances souvent liées aux problèmes de données déséquilibrées. Ce besoin se retrouve dans notre application principale, la mise au point d'un logiciel d'aide à la détection des cancers du sein.Pour cela, nous proposons de nouvelles méthodes modifiant trois différentes étapes d'un processus d'apprentissage. Tout d'abord au niveau de l'échantillonnage, nous proposons lors de l'utilisation d'un bagging, de remplacer le bootstrap classique par un échantillonnage dirigé. Nos techniques FUNSS et LARSS utilisent des propriétés de voisinage pour la sélection des individus. Ensuite au niveau de l'espace de représentation, notre contribution consiste en une méthode de construction de variables adaptées aux jeux de données déséquilibrées. Cette méthode, l'algorithme FuFeFa, est basée sur la découverte de règles d'association prédictives. Enfin, lors de l'étape d'agrégation des classifieurs de base d'un bagging, nous proposons d'optimiser le vote à la majorité en le pondérant. Pour ce faire nous avons mis en place une nouvelle mesure quantitative d'évaluation des performances d'un modèle, PRAGMA, qui permet la prise en considération de besoins spécifiques de l'utilisateur vis-à-vis des taux de rappel et de précision de chaque classe. / The problem of imbalanced datasets in supervised learning has emerged relatively recently, since the data mining has become a technology widely used in industry. The assisted medical diagnosis, the detection of fraud, abnormal phenomena, or specific elements on satellite imagery, are examples of industrial applications based on supervised learning of imbalanced datasets. The goal of our work is to bring supervised learning process on this issue. We also try to give an answer about the specific requirements of performance often related to the problem of imbalanced datasets, such as a high recall rate for the minority class. This need is reflected in our main application, the development of software to help radiologist in the detection of breast cancer. For this, we propose new methods of amending three different stages of a learning process. First in the sampling stage, we propose in the case of a bagging, to replaced classic bootstrap sampling by a guided sampling. Our techniques, FUNSS and LARSS use neighbourhood properties for the selection of objects. Secondly, for the representation space, our contribution is a method of variables construction adapted to imbalanced datasets. This method, the algorithm FuFeFa, is based on the discovery of predictive association rules. Finally, at the stage of aggregation of base classifiers of a bagging, we propose to optimize the majority vote in using weightings. For this, we have introduced a new quantitative measure of model assessment, PRAGMA, which allows taking into account user specific needs about recall and precision rates of each class.
|
8 |
A methodology for improving computed individual regressions predictions. / Uma metodologia para melhorar predições individuais de regressões.Matsumoto, Élia Yathie 23 October 2015 (has links)
This research proposes a methodology to improve computed individual prediction values provided by an existing regression model without having to change either its parameters or its architecture. In other words, we are interested in achieving more accurate results by adjusting the calculated regression prediction values, without modifying or rebuilding the original regression model. Our proposition is to adjust the regression prediction values using individual reliability estimates that indicate if a single regression prediction is likely to produce an error considered critical by the user of the regression. The proposed method was tested in three sets of experiments using three different types of data. The first set of experiments worked with synthetically produced data, the second with cross sectional data from the public data source UCI Machine Learning Repository and the third with time series data from ISO-NE (Independent System Operator in New England). The experiments with synthetic data were performed to verify how the method behaves in controlled situations. In this case, the outcomes of the experiments produced superior results with respect to predictions improvement for artificially produced cleaner datasets with progressive worsening with the addition of increased random elements. The experiments with real data extracted from UCI and ISO-NE were done to investigate the applicability of the methodology in the real world. The proposed method was able to improve regression prediction values by about 95% of the experiments with real data. / Esta pesquisa propõe uma metodologia para melhorar previsões calculadas por um modelo de regressão, sem a necessidade de modificar seus parâmetros ou sua arquitetura. Em outras palavras, o objetivo é obter melhores resultados por meio de ajustes nos valores computados pela regressão, sem alterar ou reconstruir o modelo de previsão original. A proposta é ajustar os valores previstos pela regressão por meio do uso de estimadores de confiabilidade individuais capazes de indicar se um determinado valor estimado é propenso a produzir um erro considerado crítico pelo usuário da regressão. O método proposto foi testado em três conjuntos de experimentos utilizando três tipos de dados diferentes. O primeiro conjunto de experimentos trabalhou com dados produzidos artificialmente, o segundo, com dados transversais extraídos no repositório público de dados UCI Machine Learning Repository, e o terceiro, com dados do tipo séries de tempos extraídos do ISO-NE (Independent System Operator in New England). Os experimentos com dados artificiais foram executados para verificar o comportamento do método em situações controladas. Nesse caso, os experimentos alcançaram melhores resultados para dados limpos artificialmente produzidos e evidenciaram progressiva piora com a adição de elementos aleatórios. Os experimentos com dados reais extraído das bases de dados UCI e ISO-NE foram realizados para investigar a aplicabilidade da metodologia no mundo real. O método proposto foi capaz de melhorar os valores previstos por regressões em cerca de 95% dos experimentos realizados com dados reais.
|
9 |
A methodology for improving computed individual regressions predictions. / Uma metodologia para melhorar predições individuais de regressões.Élia Yathie Matsumoto 23 October 2015 (has links)
This research proposes a methodology to improve computed individual prediction values provided by an existing regression model without having to change either its parameters or its architecture. In other words, we are interested in achieving more accurate results by adjusting the calculated regression prediction values, without modifying or rebuilding the original regression model. Our proposition is to adjust the regression prediction values using individual reliability estimates that indicate if a single regression prediction is likely to produce an error considered critical by the user of the regression. The proposed method was tested in three sets of experiments using three different types of data. The first set of experiments worked with synthetically produced data, the second with cross sectional data from the public data source UCI Machine Learning Repository and the third with time series data from ISO-NE (Independent System Operator in New England). The experiments with synthetic data were performed to verify how the method behaves in controlled situations. In this case, the outcomes of the experiments produced superior results with respect to predictions improvement for artificially produced cleaner datasets with progressive worsening with the addition of increased random elements. The experiments with real data extracted from UCI and ISO-NE were done to investigate the applicability of the methodology in the real world. The proposed method was able to improve regression prediction values by about 95% of the experiments with real data. / Esta pesquisa propõe uma metodologia para melhorar previsões calculadas por um modelo de regressão, sem a necessidade de modificar seus parâmetros ou sua arquitetura. Em outras palavras, o objetivo é obter melhores resultados por meio de ajustes nos valores computados pela regressão, sem alterar ou reconstruir o modelo de previsão original. A proposta é ajustar os valores previstos pela regressão por meio do uso de estimadores de confiabilidade individuais capazes de indicar se um determinado valor estimado é propenso a produzir um erro considerado crítico pelo usuário da regressão. O método proposto foi testado em três conjuntos de experimentos utilizando três tipos de dados diferentes. O primeiro conjunto de experimentos trabalhou com dados produzidos artificialmente, o segundo, com dados transversais extraídos no repositório público de dados UCI Machine Learning Repository, e o terceiro, com dados do tipo séries de tempos extraídos do ISO-NE (Independent System Operator in New England). Os experimentos com dados artificiais foram executados para verificar o comportamento do método em situações controladas. Nesse caso, os experimentos alcançaram melhores resultados para dados limpos artificialmente produzidos e evidenciaram progressiva piora com a adição de elementos aleatórios. Os experimentos com dados reais extraído das bases de dados UCI e ISO-NE foram realizados para investigar a aplicabilidade da metodologia no mundo real. O método proposto foi capaz de melhorar os valores previstos por regressões em cerca de 95% dos experimentos realizados com dados reais.
|
Page generated in 0.0423 seconds