Spelling suggestions: "subject:"imbalanced"" "subject:"imabalanced""
61 |
Previsão de falta de materiais no contexto de gestão inteligente de inventário: uma aplicação de aprendizado desbalanceadoSantis, Rodrigo Barbosa de 26 March 2018 (has links)
Submitted by Geandra Rodrigues (geandrar@gmail.com) on 2018-06-19T13:13:53Z
No. of bitstreams: 1
rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2018-06-27T11:12:01Z (GMT) No. of bitstreams: 1
rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5) / Made available in DSpace on 2018-06-27T11:12:01Z (GMT). No. of bitstreams: 1
rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5)
Previous issue date: 2018-03-26 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Falta de materiais é um problema comum na cadeia de suprimentos, impactando o nível de serviço e eficiência de um sistema de inventário. A identificação de materiais com grande riscos de falta antes da ocorrência do evento pode apresentar uma enorme oportunidade de melhoria no desempenho geral de uma empresa. No entanto, a complexidade deste tipo de problema é alta, devido ao desbalanceamento das classes de itens faltantes e não faltantes no inventário, que podem chegar a razões de 1 para 100. No presente trabalho, algoritmos de classificação são investigados para proposição de um modelo preditivo para preencher esta lacuna na literatura. Algumas métricas específicas como a área abaixo das curvas de Característica Operacionais do
Receptor e de Precisão-Abrangência, bem como técnicas de amostragem e comitês de aprendizado são aplicados nesta tarefa. O modelo proposto foi testado em dois estudos de caso reais, nos quais verificou-se que adoção da ferramenta pode contribuir com o aumento do nível de serviço em uma cadeia de suprimentos. / Material backorder (or stockout) is a common supply chain problem, impacting the inventory system service level and effectiveness. Identifying materials with the highest chances of shortage prior its occurrence can present a high opportunity to improve the overall company’s performance. However, the complexity of this sort of problem is high, due to class imbalance between missing items and not missing ones in inventory, which can achieve proportions of 1 to 100. In this work, machine learning classifiers are investigated in order to fulfill this gap in literature. Specific metrics such as area under the Receiver Operator Characteristic and precision-recall curves, sampling techniques and ensemble learning are employed to this particular task. The proposed model was tested in two real case-studies, in which it was verified that the use of the
tool may contribute with the improvemnet of the service level in the supply chain.
|
62 |
Analysing and predicting differences between methylated and unmethylated DNA sequence featuresAli, Isse January 2015 (has links)
DNA methylation is involved in various biological phenomena, and its dysregulation has been demonstrated as being correlated with a number of human disease processes, including cancers, autism, and autoimmune, mental health and neuro-degenerative ones. It has become important and useful in characterising and modelling these biological phenomena in or-der to understand the mechanism of such occurrences, in relation to both health and disease. An attempt has previously been made to map DNA methylation across human tissues, however, the means of distinguishing between methylated, unmethylated and differentially-methylated groups using DNA sequence features remains unclear. The aim of this study is therefore to: firstly, investigate DNA methylation classes and predict these based on DNA sequence features; secondly, to further identify methylation-associated DNA sequence features, and distinguish methylation differences between males and females in relation to both healthy and diseased, sta-tuses. This research is conducted in relation to three samples within nine biological feature sub-sets extracted from DNA sequence patterns (Human genome database). Two samples contain classes (methylated, unmethy-lated and differentially-methylated) within a total of 642 samples with 3,809 attributes driven from four human chromosomes, i.e. chromosomes 6, 20, 21 and 22, and the third sample contains all human chromosomes, which encompasses 1628 individuals, and then 1,505 CpG loci (features) were extracted by using Hierarchical clustering (a process Heatmap), along with pair correlation distance and then applied feature selection methods. From this analysis, author extract 47 features associated with gender and age, with 17 revealing significant methylation differences between males and females. Methylation classes prediction were applied a K-nearest Neighbour classifier, combined with a ten-fold cross- validation, since to some data were severely imbalanced (i.e., existed in sub-classes), and it has been established that direct analysis in machine-learning is biased towards the majority class. Hence, author propose a Modified- Leave-One-Out (MLOO) cross-validation and AdaBoost methods to tackle these issues, with the aim of compositing a balanced outcome and limiting the bias in-terference from inter-differences of the classes involved, which has provided potential predictive accuracies between 75% and 100%, based on the DNA sequence context.
|
63 |
Utveckling av beslutsstöd för kreditvärdighetArvidsson, Martin, Paulsson, Eric January 2013 (has links)
The aim is to develop a new decision-making model for credit-loans. The model will be specific for credit applicants of the OKQ8 bank, becauseit is based on data of earlier applicants of credit from the client (the bank). The final model is, in effect, functional enough to use informationabout a new applicant as input, and predict the outcome to either the good risk group or the bad risk group based on the applicant’s properties.The prediction may then lay the foundation for the decision to grant or deny credit loan. Because of the skewed distribution in the response variable, different sampling techniques are evaluated. These include oversampling with SMOTE, random undersampling and pure oversampling in the form of scalar weighting of the minority class. It is shown that the predictivequality of a classifier is affected by the distribution of the response, and that the oversampled information is not too redundant. Three classification techniques are evaluated. Our results suggest that a multi-layer neural network with 18 neurons in a hidden layer, equippedwith an ensemble technique called boosting, gives the best predictive power. The most successful model is based on a feed forward structure andtrained with a variant of back-propagation using conjugate-gradient optimization. Two other models with a good prediction quality are developed using logistic regression and a decision tree classifier, but they do not reach thelevel of the network. However, the results of these models are used to answer the question regarding which customer properties are importantwhen determining credit risk. Two examples of important customer properties are income and the number of earlier credit reports of the applicant. Finally, we use the best classification model to predict the outcome of a set of applicants declined by the existent filter. The results show that thenetwork model accepts over 60 % of the applicants who had previously been denied credit. This may indicate that the client’s suspicionsregarding that the existing model is too restrictive, in fact are true.
|
64 |
Handling Imbalanced Data Classification With Variational Autoencoding And Random Under-Sampling BoostingLudvigsen, Jesper January 2020 (has links)
In this thesis, a comparison of three different pre-processing methods for imbalanced classification data, is conducted. Variational Autoencoder, Random Under-Sampling Boosting and a hybrid approach of the two, are applied to three imbalanced classification data sets with different class imbalances. A logistic regression (LR) model is fitted to each pre-processed data set and based on its classification performance, the pre-processing methods are evaluated. All three methods shows indications of different advantages when handling class imbalances. For each pre-processed data, the LR-model has is better at correctly classifying minority class observations, compared to a LR-model fitted to the original class imbalanced data sets. Evaluating the overall classification performance, both VAE and RUSBoost shows improving classification results while the hybrid method performs worse for the moderate class imbalanced data and best for the highly imbalanced data.
|
65 |
Applying Machine Learning Methods to Predict the Outcome of Shots in FootballHedar, Sara January 2020 (has links)
The thesis investigates a publicly available dataset which covers morethan three million events in football matches. The aim of the study isto train machine learning models capable of modeling the relationshipbetween a shot event and its outcome. That is, to predict if a footballshot will result in a goal or not. By representing the shot indifferent ways, the aim is to draw conclusion regarding what elementsof a shot allows for a good prediction of its outcome. The shotrepresentation was varied both by including different numbers of eventspreceding the shot and by varying the set of features describing eachevent.The study shows that the performance of the machine learning modelsbenefit from including events preceding the shot. The highestpredictive performance was achieved by a long short-term memory neuralnetwork trained on the shot event and six events preceding the shot.The features which were found to have the largest positive impact onthe shot events were the precision of the event, the position on thefield and how the player was in contact with the ball. The size of thedataset was also evaluated and the results suggest that it issufficiently large for the size of the networks evaluated.
|
66 |
Credit Scoring using Machine Learning ApproachesChitambira, Bornvalue January 2022 (has links)
This project will explore machine learning approaches that are used in creditscoring. In this study we consider consumer credit scoring instead of corporatecredit scoring and our focus is on methods that are currently used in practiceby banks such as logistic regression and decision trees and also compare theirperformance against machine learning approaches such as support vector machines (SVM), neural networks and random forests. In our models we addressimportant issues such as dataset imbalance, model overfitting and calibrationof model probabilities. The six machine learning methods we study are support vector machine, logistic regression, k-nearest neighbour, artificial neuralnetworks, decision trees and random forests. We implement these models inpython and analyse their performance on credit dataset with 30000 observations from Taiwan, extracted from the University of California Irvine (UCI)machine learning repository.
|
67 |
Machine Learning for Classification of Temperature Controlled Containers Using Heavily Imbalanced Data / Maskininlärning för klassificering av temperatur reglerbara containrar genom användande av extremt obalanserad dataRanjith, Adam January 2022 (has links)
Temperature controllable containers are used frequently in order to transport pharmaceutical cargo all around the world. One of the leading manufacturing companies of these containers has a method for detecting containers with a faulty cooling system before making a shipment. However, the problem with this method is that the model tends to miss-classify containers. Hence, this thesis aims to investigate if machine learning usage would make classification of containers more accurate. Nonetheless, there is a problem, the data set is extremely imbalanced. If machine learning can be used to improve container manufacturing companies fault detection systems, it would imply less damaged and delayed pharmaceutical cargo which could be vital. Various combinations of machine learning classifiers and techniques for handling the imbalance were tested in order to find the most optimal one. The Random Forest classifier when using oversampling was the best performing combination which performed about equally as good as the company’s current method, with a recall score of 92% and a precision score of 34%. Earlier there were no known papers on machine learning for classification of temperature controllable containers. However, now other manufacturing companies could favourably use the concepts and methods presented in this thesis in order to enhance the effectiveness of their fault detection systems and consequently improve the overall shipping efficiency of pharmaceutical cargo. / Temperatur reglerbara containrar används frekvent inom medicinsk transport runt om i hela världen. Ett ledande företag som är tillverkare av dessa containrar använder sig av en metod för att upptäcka containrar med ett felaktigt kylsystem redan innan de hunnit ut på en transport. Denna metod är fungerande men inte perfekt då den tenderar att felaktigt klassificera containrar. Detta examensarbete är en utredande avhandling för att ta reda på om maskininlärning kan användas för att förbättra klassificeringen av containrar. Det finns dock ett problem, data setet är extremt obalanserat. Om maskininlärning kan användas för att förbättra felsökningssystemen hos tillverkare av temperatur reglerbara containrar skulle det innebära mindre förstörda samt mindre försenade medicinska transporter vilket kan vara livsavgörande. Ett urval av kombinationer mellan maskininlärnings modeller och tekniker för att hantera obalanserad data testade för att avgöra vilken som är optimal. Klassificeraren Random Forest ihop med över-sampling resulterade i best prestanda, ungefär lika bra som företagets nuvarande metod. Tidigare har det inte funnits några kända rapporter om användning av maskininlärning för att klassificera temperaturer reglerbara containrar. Nu kan dock andra tillverkare av containrar använda sig av koncept och metoder som presenterades i avhandlingen för att optimera deras felsökningssystem och således förbättra den allmänna effektiviteten inom medicinsk transport.
|
68 |
Improving classification accuracy for machine learning / 機械学習における分類精度の向上 / キカイ ガクシュウ ニオケル ブンルイ セイド ノ コウジョウ鄭 弯弯, Wanwan Zheng 22 March 2021 (has links)
本論文は,5章より構成されている。第1章では,機械学習の現状,応用及び構成を述べた上,本研究で扱った三つの課題を挙げた。第2章では,小サンプルデータの特徴選択方法を提案した。第3章では,クラスの不均衡性と学習データのサイズが分類器精度への影響を検討した。第4章では,ノイズが分類器の学習を妨げる問題点に対して,多要素ベースの学習に基づいた高速クラスノイズの検出方法を提案した。第5章では,分析の主な結果をまとめ,今後の課題と展望を述べた。 / This thesis is organized under five chapters. Chapter 1 gives a brief explanation of what machine learning is and why it matters. Chapter 2 makes a proposal to improve the performance of feature selection methods with low-sample-size data. Chapter 3 studies the effects of class imbalance and training data size on classifier learning empirically. Chapter 4 proposes a fast noise detector referring to the problems of noise detection algorithms, which are over-cleansing, large computational complexity and long response time. Chapter 5 draws a summary and the closing. / 博士(文化情報学) / Doctor of Culture and Information Science / 同志社大学 / Doshisha University
|
69 |
Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance TradeoffsHu, Li Ang, Ma, Long January 2023 (has links)
This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy. Synthetic Minority Oversampling Technique (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE) yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets.
|
70 |
The Effect of Variability Imbalance on Lead Timerokni, mitra January 2022 (has links)
This master's thesis investigates the impact of unbalanced variability on lead time using a simulation-based optimization approach. Based on the hypothesis, variation of service time has a strong effect on lead time. It has also been hypothesized that placing the high variable station, in terms of CVp, at the end of the line will increase lead time. To evaluate these hypotheses, Fact Analyzer Simulation Software version beta7 was used to simulate and optimize two different models. First, the effect of an imbalanced line in terms of service time on total lead time in a simple production line hypothetical model was investigated. In the second part of this master thesis, a real health care model was adopted from Frandsen and Engqvist’s project at Skaraborg Hospital (SkaS). By optimizing this model, using NSGA_II Algorithms, the effect of variance and mean of service time on lead time variance and mean were evaluated and compared. As a result of both the hypothetical and health care models, the total lead time will not decrease by reducing the variance of service time, indicating that the hypothesis should be rejected. Keywords: service time, CVp, lead time, imbalanced, variability, mean lead time, variance lead time, waiting time, variability
|
Page generated in 0.0669 seconds