Spelling suggestions: "subject:"imbalanced data"" "subject:"imbalanced mata""
11 |
A Classification Framework for Imbalanced DataPhoungphol, Piyaphol 18 December 2013 (has links)
As information technology advances, the demands for developing a reliable and highly accurate predictive model from many domains are increasing. Traditional classification algorithms can be limited in their performance on highly imbalanced data sets. In this dissertation, we study two common problems when training data is imbalanced, and propose effective algorithms to solve them.
Firstly, we investigate the problem in building a multi-class classification model from imbalanced class distribution. We develop an effective technique to improve the performance of the model by formulating the problem as a multi-class SVM with an objective to maximize G-mean value. A ramp loss function is used to simplify and solve the problem. Experimental results on multiple real-world datasets confirm that our new method can effectively solve the multi-class classification problem when the datasets are highly imbalanced.
Secondly, we explore the problem in learning a global classification model from distributed data sources with privacy constraints. In this problem, not only data sources have different class distributions but combining data into one central data is also prohibited. We propose a privacy-preserving framework for building a global SVM from distributed data sources. Our new framework avoid constructing a global kernel matrix by mapping non-linear inputs to a linear feature space and then solve a distributed linear SVM from these virtual points. Our method can solve both imbalance and privacy problems while achieving the same level of accuracy as regular SVM.
Finally, we extend our framework to handle high-dimensional data by utilizing Generalized Multiple Kernel Learning to select a sparse combination of features and kernels. This new model produces a smaller set of features, but yields much higher accuracy.
|
12 |
Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in BioinformaticsDing, Zejin 07 May 2011 (has links)
In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem.
|
13 |
Técnicas para o problema de dados desbalanceados em classificação hierárquica / Techniques for the problem of imbalanced data in hierarchical classificationVictor Hugo Barella 24 July 2015 (has links)
Os recentes avanços da ciência e tecnologia viabilizaram o crescimento de dados em quantidade e disponibilidade. Junto com essa explosão de informações geradas, surge a necessidade de analisar dados para descobrir conhecimento novo e útil. Desse modo, áreas que visam extrair conhecimento e informações úteis de grandes conjuntos de dados se tornaram grandes oportunidades para o avanço de pesquisas, tal como o Aprendizado de Máquina (AM) e a Mineração de Dados (MD). Porém, existem algumas limitações que podem prejudicar a acurácia de alguns algoritmos tradicionais dessas áreas, por exemplo o desbalanceamento das amostras das classes de um conjunto de dados. Para mitigar tal problema, algumas alternativas têm sido alvos de pesquisas nos últimos anos, tal como o desenvolvimento de técnicas para o balanceamento artificial de dados, a modificação dos algoritmos e propostas de abordagens para dados desbalanceados. Uma área pouco explorada sob a visão do desbalanceamento de dados são os problemas de classificação hierárquica, em que as classes são organizadas em hierarquias, normalmente na forma de árvore ou DAG (Direct Acyclic Graph). O objetivo deste trabalho foi investigar as limitações e maneiras de minimizar os efeitos de dados desbalanceados em problemas de classificação hierárquica. Os experimentos realizados mostram que é necessário levar em consideração as características das classes hierárquicas para a aplicação (ou não) de técnicas para tratar problemas dados desbalanceados em classificação hierárquica. / Recent advances in science and technology have made possible the data growth in quantity and availability. Along with this explosion of generated information, there is a need to analyze data to discover new and useful knowledge. Thus, areas for extracting knowledge and useful information in large datasets have become great opportunities for the advancement of research, such as Machine Learning (ML) and Data Mining (DM). However, there are some limitations that may reduce the accuracy of some traditional algorithms of these areas, for example the imbalance of classes samples in a dataset. To mitigate this drawback, some solutions have been the target of research in recent years, such as the development of techniques for artificial balancing data, algorithm modification and new approaches for imbalanced data. An area little explored in the data imbalance vision are the problems of hierarchical classification, in which the classes are organized into hierarchies, commonly in the form of tree or DAG (Direct Acyclic Graph). The goal of this work aims at investigating the limitations and approaches to minimize the effects of imbalanced data with hierarchical classification problems. The experimental results show the need to take into account the features of hierarchical classes when deciding the application of techniques for imbalanced data in hierarchical classification.
|
14 |
Application of Machine Learning Strategies to Improve the Prediction of Changes in the Airline Network TopologyAleksandra Dervisevic (9873020) 18 December 2020 (has links)
<div><p>Predictive modeling allows us to analyze historical patterns to forecast future events. When the data available for this analysis is imbalanced or skewed, many challenges arise. The lack of sensitivity towards the class with less data available hinders the sought-after predictive capabilities of the model. These imbalanced datasets are found across many different fields, including medical imaging, insurance claims and financial frauds. The objective of this thesis is to identify the challenges, and means to assess, the application of machine learning to transportation data that is imbalanced and using only one independent variable. </p><p>Airlines undergo a decision-making process on air route addition or deletion in order to adjust the services offered with respect to demand and cost, amongst other criteria. This process greatly affects the topology of the network, and results in a continuously evolving Air Traffic Network (ATN). Organizations like the Federal Aviation Administration (FAA) are interested in the network transformation and the influence airlines have as stakeholders. For this reason, they attempt to model the criteria used by airlines to modify routes. The goal is to be able to predict trends and dependencies observed in the network evolution, by understanding the relation between the number of passengers per flight leg as the single independent variable and the airline’s decision to keep or eliminate that route (the dependent variable). Research to date has used optimization-based methods and machine learning algorithms to model airlines’ decision-making process on air route addition and deletion, but these studies demonstrate less than a 50% accuracy. </p><p>In particular, two machine learning (ML) algorithms are examined: Sparse Gaussian Classification (SGC) and Deep Neural Networks (DNN). SGC is the extension of Gaussian Process Classification models to large datasets. These models use Gaussian Processes (GPs), which are proven to perform well in binary classification problems. DNN uses multiple layers of probabilities between the input and output layers. It is one of the most popular ML algorithms currently in use, so the results obtained using SGC were compared to the DNN model. </p><p>At a first glance, these two models appear to perform equally, giving a high accuracy output of 97.77%. However, post-processing the results using a simple Bayes classifier and using the appropriate metrics for measuring the performance of models trained with imbalanced datasets reveals otherwise. The results in both SGC and DNN provided predictions with a 1% of precision and 20% of recall with an score of 0.02 and an AUC (Area Under the Curve) of 0.38 and 0.31 respectively. The low score indicates the classifier is not performing accurately, and the AUC value confirms the inability of the models to differentiate between the classes. This is probably due to the existing interaction and competition of the airlines in the market, which is not captured by the models. Interestingly enough, the behavior of both models is very different across the range of threshold values. The SGC model captured more effectively the low confidence in these results. In order to validate the model, a stratified K-fold cross-validation model was run. </p>The future application of Gaussian Processes in model-building for decision-making will depend on a clear understanding of its limitations and the imbalanced datasets used in the process, the central purpose of this thesis. Future steps in this investigation include further analysis of the training data as well as the exploration of variable-optimization algorithms. The tuning process of the SGC model could be improved by utilizing optimal hyperparameters and inducing inputs.<br></div><div><div><br></div></div>
|
15 |
Purchase Probability Prediction : Predicting likelihood of a new customer returning for a second purchase using machine learning methodsAlstermark, Olivia, Stolt, Evangelina January 2021 (has links)
When a company evaluates a customer for being a potential prospect, one of the key questions to answer is whether the customer will generate profit in the long run. A possible step to answer this question is to predict the likelihood of the customer returning to the company again after the initial purchase. The aim of this master thesis is to investigate the possibility of using machine learning techniques to predict the likelihood of a new customer returning for a second purchase within a certain time frame. To investigate to what degree machine learning techniques can be used to predict probability of return, a number of di↵erent model setups of Logistic Lasso, Support Vector Machine and Extreme Gradient Boosting are tested. Model development is performed to ensure well-calibrated probability predictions and to possibly overcome the diculty followed from an imbalanced ratio of returning and non-returning customers. Throughout the thesis work, a number of actions are taken in order to account for data protection. One such action is to add noise to the response feature, ensuring that the true fraction of returning and non-returning customers cannot be derived. To further guarantee data protection, axes values of evaluation plots are removed and evaluation metrics are scaled. Nevertheless, it is perfectly possible to select the superior model out of all investigated models. The results obtained show that the best performing model is a Platt calibrated Extreme Gradient Boosting model, which has much higher performance than the other models with regards to considered evaluation metrics, while also providing predicted probabilities of high quality. Further, the results indicate that the setups investigated to account for imbalanced data do not improve model performance. The main con- clusion is that it is possible to obtain probability predictions of high quality for new customers returning to a company for a second purchase within a certain time frame, using machine learning techniques. This provides a powerful tool for a company when evaluating potential prospects.
|
16 |
Comparative Data Analytic Approach for Detection of DiabetesSood, Radhika January 2018 (has links)
No description available.
|
17 |
Classification of imbalanced disparate medical data using ontology / Klassificering av Obalanserad Medicinsk Data med OntologierKarlsson, Ludvig, Wilhelm Kopp Sundin, Gustav January 2023 (has links)
Through the digitization of healthcare, large volumes of data are generated and stored in healthcare operations. Today, a multitude of platforms and digital infrastructures are used for storage and management of data. The systems lack a common ontology which limits the interoperability between datasets. Limited interoperability impacts various areas of healthcare, for instance sharing of data between entities and the possibilities for aggregated machine learning research incorporating distributed data. This study examines how a random forest classifier performs on two datasets consisting of phase III clinical trial studies on small-cell lung cancer where the datasets do not share a common ontology. The performance is then compared to the same classifier’s performance on one dataset consisting of a connection of the two earlier mentioned sets where a common ontology is implemented. The study does not show unambiguous results indicating that a single ontology is creating a better performance for the random forest classifier. In addition, the conditions of entities within primary care in Sweden for undergoing a transition to a new platform for storage of data is discussed together with areas for future research. / Till följd av digitaliseringen inom hälso- och sjukvården genereras stora volymer data som lagras och används i verksamheten. Idag används en mängd olika plattformar för lagring och hantering av data. Systemen saknar en gemensam ontologi, vilket begränsar interoperabiliteten mellan datamängderna. Bristande interoperabilitet påverkar olika områden inom hälso- och sjukvården, till exempel delning av data mellan vårdinstanser och möjligheterna för forskning på en aggregerad nivå där maskininlärning används. Denna studie undersöker hur en random forest klassificerare presterar på två dataset bestående av fas III kliniska prövningar av småcellig lungcancer där dataseten inte delar en gemensam ontologi. Prestandan jämförs sedan med samma klassificerares prestanda på ett dataset som består av en anslutning mellan de två tidigare nämnda dataseten där en gemensam ontologi har implementerats. Studien visar inte entydiga resultat som indikerar att en gemensam eller icke-gemensam ontologi skapar bättre prestanda för en random forest klassificerare. Vidare diskuteras förutsättningarna och krav på förändringsprocessen för en övergång till Centrum för Datadriven Hälsas föreslagna plattform utifrån en klinik inom primärvårdens perspektiv.
|
18 |
Multi-Class Classification for Predicting Customer Satisfaction : Application of machine learning methods to predict customer satisfaction at IKEABackerholm, Stina, Börjesjö, Malin January 2023 (has links)
Gaining a comprehensive understanding of the features that contribute to customer satisfaction after contact with IKEA’s Remote Customer Meeting Points (RCMPs) is essential for implementing effective remedial measures in the future. The aim of this project is to investigate if it is possible to find key features that influence customer satisfaction and to use these to predict customer satisfaction. The task has been approached as a multi-class classification problem, with the objective of classifying the observations into five distinct levels of customer satisfaction. The study utilized three models, Multinomial Logistic Regression, Random Forest, and Extreme Gradient Boosting, to investigate these possibilities. Based on the methods used and the available data, the results indicate that it is currently not feasible to accurately identify key features or predict customer satisfaction. / Att förstå vilka faktorer som bidrar till kundnöjdhet efter en kontakt med IKEAs RCMPs är avgörande för att kunna genomföra effektiva åtgärder i framtiden. Syftet med detta projekt är att undersöka om det är möjligt att hitta nyckelfaktorer som påverkar kundnöjdhet och använda dessa för att prediktera kundnöjdhet. Uppgiften har angripits som ett multi-klass klassificeringsproblem, med syftet att klas- sificera observationerna i fem olika nivåer av kundnöjdhet. Studien har utvärderat tre olika modeller, Multinomial Logistic Regression, Random Forest och Extreme Gradient Boosting, för att undersöka dessa möjligheter. Baserat på de använda metoderna med tillgängliga data, indikerar resultaten att det för tillfället inte är möjligt att identifiera nyckelfaktorer eller prediktera kundnöjdhet med hög noggrannhet.
|
19 |
Machine Learning Models to Predict Cracking on Steel Slabs During Continuous CastingSibanda, Jacob January 2024 (has links)
Surface defects in steel slabs during continuous casting pose significant challengesfor quality control and product integrity in the steel industry. Predicting and classifyingthese defects accurately is crucial for ensuring product quality and minimizing productionlosses. This thesis investigates the effectiveness of machine learning models in predictingsurface defects of varying severity levels (ordinal classes) during the primary coolingstage of continuous casting. The study evaluates four machine learning algorithms,namely, XGBoost (main and baseline models), Decision Tree, and One-vs.-Rest SupportVector Machine (O-SVM), all trained with imbalanced defect class data. Model evaluationis conducted using a set of performance metrics, including precision, recall, F1-score,accuracy, macro-averaged Mean Absolute Error (MAE), Receiver Operating Characteristic(ROC) curves, Weighted Kappa and Ordinal Classification Index (OCI). Results indicatethat the XGBoost main model demonstrates robust performance across most evaluationmetrics, with high accuracy, precision, recall, and F1-score. Furthermore, incorporatingtemperature data from the primary cooling process inside the mold significantly enhancesthe predictive capabilities of machine learning models for defect prediction in continuouscasting. Key process parameters associated with defect formation, such as tundish temperature,casting speed, stopper rod argon pressure, and submerged entry nozzle (SEN) argonflow, are identified as significant contributors to defect severity. Feature importance andSHAP (SHapley Additive exPlanations) analysis reveal insights into the relationship betweenprocess variables and defect formation. Challenges and trade-offs, including modelcomplexity, interpretability, and computational efficiency, are discussed. Future researchdirections include further optimization and refinement of machine learning models andcollaboration with industry stakeholders to develop tailored solutions for defect predictionand quality control in continuous casting processes.
|
20 |
Comparison of Machine Learning Techniques when Estimating Probability of Impairment : Estimating Probability of Impairment through Identification of Defaulting Customers one year Ahead of Time / En jämförelse av maskininlärningstekniker för uppskattning av Probability of Impairment : Uppskattningen av Probability of Impairment sker genom identifikation av låntagare som inte kommer fullfölja sina återbetalningsskyldigheter inom ett årEriksson, Alexander, Långström, Jacob January 2019 (has links)
Probability of Impairment, or Probability of Default, is the ratio of how many customers within a segment are expected to not fulfil their debt obligations and instead go into Default. This is a key metric within banking to estimate the level of credit risk, where the current standard is to estimate Probability of Impairment using Linear Regression. In this paper we show how this metric instead can be estimated through a classification approach with machine learning. By using models trained to find which specific customers will go into Default within the upcoming year, based on Neural Networks and Gradient Boosting, the Probability of Impairment is shown to be more accurately estimated than when using Linear Regression. Additionally, these models provide numerous real-life implementations internally within the banking sector. The new features of importance we found can be used to strengthen the models currently in use, and the ability to identify customers about to go into Default let banks take necessary actions ahead of time to cover otherwise unexpected risks. / Titeln på denna rapport är En jämförelse av maskininlärningstekniker för uppskattning av Probability of Impairment. Uppskattningen av Probability of Impairment sker genom identifikation av låntagare som inte kommer fullfölja sina återbetalningsskyldigheter inom ett år. Probability of Impairment, eller Probability of Default, är andelen kunder som uppskattas att inte fullfölja sina skyldigheter som låntagare och återbetalning därmed uteblir. Detta är ett nyckelmått inom banksektorn för att beräkna nivån av kreditrisk, vilken enligt nuvarande regleringsstandard uppskattas genom Linjär Regression. I denna uppsats visar vi hur detta mått istället kan uppskattas genom klassifikation med maskininlärning. Genom användandet av modeller anpassade för att hitta vilka specifika kunder som inte kommer fullfölja sina återbetalningsskyldigheter inom det kommande året, baserade på Neurala Nätverk och Gradient Boosting, visas att Probability of Impairment bättre uppskattas än genom Linjär Regression. Dessutom medför dessa modeller även ett stort antal interna användningsområden inom banksektorn. De nya variabler av intresse vi hittat kan användas för att stärka de modeller som idag används, samt förmågan att identifiera kunder som riskerar inte kunna fullfölja sina skyldigheter låter banker utföra nödvändiga åtgärder i god tid för att hantera annars oväntade risker.
|
Page generated in 0.0502 seconds