Global ETD Search

11	Utveckling av beslutsstöd för kreditvärdighet Arvidsson, Martin, Paulsson, Eric January 2013 (has links) The aim is to develop a new decision-making model for credit-loans. The model will be specific for credit applicants of the OKQ8 bank, becauseit is based on data of earlier applicants of credit from the client (the bank). The final model is, in effect, functional enough to use informationabout a new applicant as input, and predict the outcome to either the good risk group or the bad risk group based on the applicant’s properties.The prediction may then lay the foundation for the decision to grant or deny credit loan. Because of the skewed distribution in the response variable, different sampling techniques are evaluated. These include oversampling with SMOTE, random undersampling and pure oversampling in the form of scalar weighting of the minority class. It is shown that the predictivequality of a classifier is affected by the distribution of the response, and that the oversampled information is not too redundant. Three classification techniques are evaluated. Our results suggest that a multi-layer neural network with 18 neurons in a hidden layer, equippedwith an ensemble technique called boosting, gives the best predictive power. The most successful model is based on a feed forward structure andtrained with a variant of back-propagation using conjugate-gradient optimization. Two other models with a good prediction quality are developed using logistic regression and a decision tree classifier, but they do not reach thelevel of the network. However, the results of these models are used to answer the question regarding which customer properties are importantwhen determining credit risk. Two examples of important customer properties are income and the number of earlier credit reports of the applicant. Finally, we use the best classification model to predict the outcome of a set of applicants declined by the existent filter. The results show that thenetwork model accepts over 60 % of the applicants who had previously been denied credit. This may indicate that the client’s suspicionsregarding that the existing model is too restrictive, in fact are true. Credit Scoring Data mining Imbalanced data sets Sampling techniques SMOTE Classification techniques Predictive modeling Other Computer and Information Science Annan data- och informationsvetenskap
12	Design and Analysis of Techniques for Multiple-Instance Learning in the Presence of Balanced and Skewed Class Distributions Wang, Xiaoguang January 2015 (has links) With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, the Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Existing knowledge discovery and data analyzing techniques have shown great success in many real-world applications such as applying Automatic Target Recognition (ATR) methods to detect targets of interest in imagery, drug activity prediction, computer vision recognition, and so on. Among these techniques, Multiple-Instance (MI) learning is different from standard classification since it uses a set of bags containing many instances as input. The instances in each bag are not labeled \| instead the bags themselves are labeled. In this area many researchers have accomplished a lot of work and made a lot of progress. However, there still exist some areas which are not covered. In this thesis, we focus on two topics of MI learning: (1) Investigating the relationship between MI learning and other multiple pattern learning methods, which include multi-view learning, data fusion method and multi-kernel SVM. (2) Dealing with the class imbalance problem of MI learning. In the first topic, three different learning frameworks will be presented for general MI learning. The first uses multiple view approaches to deal with MI problem, the second is a data fusion framework, and the third framework, which is an extension of the first framework, uses multiple-kernel SVM. Experimental results show that the approaches presented work well on solving MI problem. The second topic is concerned with the imbalanced MI problem. Here we investigate the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. For this problem, we propose three solution frameworks: a data re-sampling framework, a cost-sensitive boosting framework and an adaptive instance-weighted boosting SVM (with the name IB_SVM) for MI learning. Experimental results - on both benchmark datasets and application datasets - show that the proposed frameworks are proved to be effective solutions for the imbalanced problem of MI learning. Multiple-Instance Learning Balanced Class Distributions Skewed Class Distributions Multi-View Data Fusion Multi-Kernel SVM SMOTE Cost-sensitive Boosting Instance-weighted Boosting SVM
13	A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification Brandt, Jakob, Lanzén, Emil January 2021 (has links) In this thesis, the performance of two over-sampling techniques, SMOTE and ADASYN, is compared. The comparison is done on three imbalanced data sets using three different classification models and evaluation metrics, while varying the way the data is pre-processed. The results show that both SMOTE and ADASYN improve the performance of the classifiers in most cases. It is also found that SVM in conjunction with SMOTE performs better than with ADASYN as the degree of class imbalance increases. Furthermore, both SMOTE and ADASYN increase the relative performance of the Random forest as the degree of class imbalance grows. However, no pre-processing method consistently outperforms the other in its contribution to better performance as the degree of class imbalance varies. Machine learning supervised learning classification class imbalance over-sampling SMOTE ADASYN Sensitivity F-measure Matthews correlation coefficient Probability Theory and Statistics Sannolikhetsteori och statistik
14	Classification of COVID-19 Using Synthetic Minority Over-Sampling and Transfer Learning Ormos, Christian January 2020 (has links) The 2019 novel coronavirus has been proven to present several unique features on chest X-rays and CT-scans that distinguish it from imaging of other pulmonary diseases such as bacterial pneumonia and viral pneumonia unrelated to COVID-19. However, the key characteristics of a COVID-19 infection have been proven challenging to detect with the human eye. The aim of this project is to explore if it is possible to distinguish a patient with COVID-19 from a patient who is not suffering from the disease from posteroanterior chest X-ray images using synthetic minority over-sampling and transfer learning. Furthermore, the report will also present the mechanics of COVID-19, the used dataset and models and the validity of the results. transfer learning AI machine learning image recognition image augmentation covid-19 VGG MobileNet InceptionV3 SMOTE k-nn Computer and Information Sciences Data- och informationsvetenskap
15	Differential evolution technique on weighted voting stacking ensemble method for credit card fraud detection Dolo, Kgaugelo Moses 12 1900 (has links) Differential Evolution is an optimization technique of stochastic search for a population-based vector, which is powerful and efficient over a continuous space for solving differentiable and non-linear optimization problems. Weighted voting stacking ensemble method is an important technique that combines various classifier models. However, selecting the appropriate weights of classifier models for the correct classification of transactions is a problem. This research study is therefore aimed at exploring whether the Differential Evolution optimization method is a good approach for defining the weighting function. Manual and random selection of weights for voting credit card transactions has previously been carried out. However, a large number of fraudulent transactions were not detected by the classifier models. Which means that a technique to overcome the weaknesses of the classifier models is required. Thus, the problem of selecting the appropriate weights was viewed as the problem of weights optimization in this study. The dataset was downloaded from the Kaggle competition data repository. Various machine learning algorithms were used to weight vote a class of transaction. The differential evolution optimization techniques was used as a weighting function. In addition, the Synthetic Minority Oversampling Technique (SMOTE) and Safe Level Synthetic Minority Oversampling Technique (SL-SMOTE) oversampling algorithms were modified to preserve the definition of SMOTE while improving the performance. Result generated from this research study showed that the Differential Evolution Optimization method is a good weighting function, which can be adopted as a systematic weight function for weight voting stacking ensemble method of various classification methods. / School of Computing / M. Sc. (Computing) Differentia evolution Weighted voting Stacking ensemble method Class distribution Data distribution SMOTE Machine learning Bid data Credit card fraud 364.163 Credit Card Fraud
16	Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance Tradeoffs Hu, Li Ang, Ma, Long January 2023 (has links) This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy. Synthetic Minority Oversampling Technique (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE) yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets. Imbalanced datasets Swedish text financial datasets Accuracy Matthews correlation coefficient Recall Multinomial Naive Bayes SMOTE TomekLinks Performance optimization Computer Sciences Datavetenskap (datalogi)
17	Predicting the Impact of Supply Chain Disruptions Using Statistical Analysis and Machine Learning / Prediktering av följderna från störningar i en försörjningskedja med användning av statistisk analys och maskininlärning Andersson, Hannes, Sjöberg, John January 2023 (has links) The dairy business is vulnerable to supply chain disruptions since large safety stocks to cover up losses are not always a viable option, therefore it is crucial to maintain a smooth supply chain to ensure stable delivery accuracies. Disruptions are unpredictable and hard to avoid in the supply chain, especially in cases where production errors cause lost production volume. This thesis proposes the use of machine learning and statistical modelling together with data from Arla to predict when a shortage will occur and its duration to allow proactive decision making to mitigate the consequences of the disruption. The aim of this thesis is to create one predictive model for delay and one for duration based on data from multiple products and explore how the features and methods used can capture the product specific characteristics in the data and thereupon improve the models. The model used for evaluating these factors was a random forest classifier, and permutation feature importance was used to determine the relevant features for the models. The issue of having imbalanced data was handled by first grouping the data and then applying the oversampling method SMOTE. The two models were trained on different datasets where the duration model was trained on all disruptions and the delay model was only trained on a subset were a shortage have occurred. One finding was that applying SMOTE yielded the best results. The best duration model had an accuracy of 62% with precision and recall of 79% and 76% respectively for the majority class, but very low for the other classes with a combined average of 21% and 24%. The most important feature for the duration was the the quotient describing the lost production. The best delay model had an accuracy of 62% with more accurate predictions over all classes and an average precision and recall of 59% and 57%. The most important feature for the delay was how often a product is produced. / Mejeribranschen är sårbar för störningar i försörjningskedjan eftersom stora säkerhetslager för att täcka förluster inte alltid är ett genomförbart alternativ, därför är det avgörande att upprätthålla en smidig försörjningskedja för att säkerställa stabila leveransnivåer. Störningar är oförutsägbara och svåra att undvika i en försörjningskedja, särskilt i de fall där produktionsfel orsakar minskad produktionsvolym. Denna uppsats föreslår användning av maskininlärning och statistisk modellering tillsammans med data från Arla för att prediktera när en brist kommer att uppstå i förhållande till störningen samt bristens varaktighet för att möjliggöra proaktiva beslut som förmildrar konsekvenserna av störningen. Målet med denna uppsats är att skapa en prediktiv modell för fördröjning och en för varaktighet baserad på data från flera produkter och undersöka hur de variabler och metoder som användes kan fånga produktspecifika egenskaper i data och därav förbättra modellen. Modellen som användes för att utvärdera dessa faktorer var en random forest klassificerare, och permutation feature importance användes för att utvärdera de använda variablerna för modellerna. Obalanserad data hanterades genom att först gruppera datan och sedan tillämpa översamplingsmetoden SMOTE. De två modellerna tränades på olika data där varaktighetsmodellen tränades på alla störningar och fördröjningsmodellen endast tränades på de fall där en brist uppstått. En slutsats var att tillämpning av SMOTE gav de bästa resultaten. Den bästa varaktighetsmodellen hade en noggrannhet på 62% med precision och recall på 79% respektive 76% för majoritetsklassen men mycket lägre för de andra klasserna med en genomsnittlig precision och recall på 21% och 24%. Den viktigaste variabeln för varaktigheten var kvoten som beskriver den förlorade produktionen. Den bästa fördröjningsmodellen hade en noggrannhet på 62% med stabilare prediktioner över alla klasser och en genomsnittlig precision och recall på 59% och 57%. Den viktigaste variabeln för fördröjningen var hur ofta en produkt produceras. Supply chain disruption SMOTE feature engineering machine learning random forest statistics applied mathematics Störning i försörjningskedja maskininlärning matematik statistik Other Mathematics Annan matematik
18	Neonatal Sepsis Detection With Random Forest Classification for Heavily Imbalanced Data Osman Abubaker, Ayman January 2022 (has links) Neonatal sepsis is associated with most cases ofmortality in the neonatal intensive care unit. Major challengesin detecting sepsis using suitable biomarkers has lead people tolook for alternative approaches in the form of Machine Learningtechniques. In this project, Random Forest classification wasperformed on a sepsis data set provided by Karolinska Hospital.We particularly focused on tackling class imbalance in the datausing sampling and cost-sensitive techniques. We compare theclassification performances of Random Forests in six differentsetups; four using oversampling and undersampling techniques;one using cost-sensitive learning and one basic Random Forest.The performance with the oversampling techniques were betterand could identify more sepsis patients than the other setups.The overall performances were also good, making the methodspotentially useful in practice. / Neonatal sepsis är orsaken till majoriteten av mortaliteten i neonatal intensivvården. Svårigheten i att detektera sepsis med hjälp av biomarkörer har lett många att leta efter alternativa metoder. Maskininlärningstekniker är en sådan alternativ metod som har i senaste tider ökat i användning inom vård och andra sektorer. I detta project användes Random Forest klassifikations algoritmen på en sepsis datamängd given av Karolinska Sjukhuset. Vi fokuserade på att hantera klassimbalansen i datan genom att använda olika provtagningsoch kostnadskänsliga metoder. Vi jämförde klassificeringsprestanda för Random Forest med sex olika inställningar; fyra av de använde provtagingsmetoderna; en av de använde en kostnadskänslig metod och en var en vanlig Random Forest. Det visade sig att modellens prestanda ökade som mest med översamplings metoderna. Den generella klassificeringsprestandan var också bra, vilket gör Random Forests tillsammans med ingsmetoderna potentiellt användbar i praktiken. / Kandidatexjobb i elektroteknik 2022, KTH, Stockholm Random Forest Neonatal Sepsis Imbalanced Classification Cost-sensitive SMOTE ADASYN CNN Tomek- Links Elektroteknik och elektronik
19	[en] MACHINE LEARNING METHODS APPLIED TO PREDICTIVE MODELS OF CHURN FOR LIFE INSURANCE / [pt] MÉTODOS DE MACHINE LEARNING APLICADOS À MODELAGEM PREDITIVA DE CANCELAMENTOS DE CLIENTES PARA SEGUROS DE VIDA THAIS TUYANE DE AZEVEDO 26 September 2018 (has links) [pt] O objetivo deste estudo foi explorar o problema de churn em seguros de vida, no sentido de prever se o cliente irá cancelar o produto nos próximos 6 meses. Atualmente, métodos de machine learning vêm se popularizando para este tipo de análise, tornando-se uma alternativa ao tradicional método de modelagem da probabilidade de cancelamento através da regressão logística. Em geral, um dos desafios encontrados neste tipo de modelagem é que a proporção de clientes que cancelam o serviço é relativamente pequena. Para isso, este estudo recorreu a técnicas de balanceamento para tratar a base naturalmente desbalanceada – técnicas de undersampling, oversampling e diferentes combinações destas duas foram utilizadas e comparadas entre si. As bases foram utilizadas para treinar modelos de Bagging, Random Forest e Boosting, e seus resultados foram comparados entre si e também aos resultados obtidos através do modelo de Regressão Logística. Observamos que a técnica SMOTE-modificado para balanceamento da base, aplicada ao modelo de Bagging, foi a combinação que apresentou melhores resultados dentre as combinações exploradas. / [en] The purpose of this study is to explore the churn problem in life insurance, in the sense of predicting if the client will cancel the product in the next 6 months. Currently, machine learning methods are becoming popular in this type of analysis, turning it into an alternative to the traditional method of modeling the probability of cancellation through logistics regression. In general, one of the challenges found in this type of modelling is that the proportion of clients who cancelled the service is relatively small. For this, the study resorted to balancing techniques to treat the naturally unbalanced base – under-sampling and over-sampling techniques and different combinations of these two were used and compared among each other. The bases were used to train models of Bagging, Random Forest and Boosting, and its results were compared among each other and to the results obtained through the Logistics Regression model. We observed that the modified SMOTE technique to balance the base, applied to the Bagging model, was the combination that presented the best results among the explored combinations. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] ARVORE DE DECISAO [en] DECISION TREE [pt] SEGURO DE VIDA [en] LIFE INSURANCE [pt] BOOSTING [en] BOOSTING [pt] PROPENSAO A CANCELAMENTO [en] CANCELLATION PROPENSITY [pt] BAGGING [en] BAGGING [pt] RANDOM FOREST [en] RANDOM FOREST [pt] DADO DESBALANCEADO [en] UNBALANCED DATA [pt] UNDER SAMPLING [en] UNDER SAMPLING [pt] OVER SAMPLING [en] OVER SAMPLING [pt] SMOTE [en] SMOTE
20	A Study on Comparison Websites in the Airline Industry and Using CART Methods to Determine Key Parameters in Flight Search Conversion / En studie av jämförelsehemsidor i flygbranschen och tillämpningen av CART metoder för att analysera nyckelparametrar i konvertering av flygsökningar. Hansén, Jacob, Gustafsson, Axel January 2019 (has links) This bachelor thesis in applied mathematics and industrial engineering and management aimed to identify relationships between search parameters in flight comparison search engines and the exit conversion rate, while also investigating how the emergence of such comparison search engines has impacted the airline industry. To identify such relationships, several classification models were employed in conjunction with several sampling methods to produce a predictive model using the program R. To investigate the impact of the emergence of comparison websites, Porter's 5 forces and a SWOT - analysis were employed to analyze findings of a literature study and a qualitative interview. The classification models developed performed poorly with regards to several assessments metrics which suggested that there were little to no significance in the relationship between the search parameters investigated and exit conversion rate. Porter's 5 forces and the SWOT-analysis suggested that the competitive landscape of the airline industry has become more competitive and that airlines which do not manage to adapt to this changing market environment will experience decreasing profitability. / Detta kandidatexamensarbete inriktat på tillämpad matematik och industriell ekonomi syftade till att identifiera samband mellan sökparametrar från flygsökmotorer och konverteringsgraden för utträde till ett flygbolags hemsida, och samtidigt undersöka hur uppkomsten av flygsökmotorer har påverkat flygindustrin för flygbolag. För att identifiera sådana samband, tillämpades flera klassificeringsmodeller tillsammans med stickprovsmetoder för att bygga en predikativ modell i programmet R. För att undersöka påverkan av flygsökmotorer tillämpades Porters 5 krafter och SWOT-analys som teoretiska ramverk för att analysera information uppsamlad genom en litteraturstudie och en intervju. Klassificeringsmodellerna som byggdes presterade undermåligt med avseende på flera utvärderingsmått, vilket antydde att det fanns lite eller inget samband mellan de undersökta sökparametrarna och konverteringsgraden för utträde. Porters 5 krafter och SWOT-analysen visade att flygindustrin hade blivit mer konkurrensutsatt och att flygbolag som inte lyckas anpassa sig efter en omgivning i ändring kommer att uppleva minskande lönsamhet. True positives true negatives false positives false negatives Classification Trees Random Forest SMOTE ROSE ROC AUC LCC meta-search engine Online Travel Agency Gini impurity index Sann negativ sann positiv falsk positiv falsk negativ klassificationsträd Random Forest SMOTE ROSE ROC AUC jämförelsehemsida resebyrå Gini koefficient Probability Theory and Statistics Sannolikhetsteori och statistik

Search results