• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 9
  • 1
  • Tagged with
  • 11
  • 11
  • 6
  • 5
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

SCUT-DS: Methodologies for Learning in Imbalanced Data Streams

Olaitan, Olubukola January 2018 (has links)
The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that instances of some classes occur more frequently than others. Classes with the frequently occurring instances are referred to as the majority classes, while the classes with instances that occur less frequently are denoted as the minority classes. Classification algorithms, or supervised learning techniques, use historic instances to build models, which are then used to predict the classes of unseen instances. Multi-class imbalanced data stream classification poses a great challenge to classical classification algorithms. This is due to the fact that traditional algorithms are usually biased towards the majority classes, since they have more examples of the majority classes when building the model. These traditional algorithms yield low predictive accuracy rates for the minority instances and need to be augmented, often with some form of sampling, in order to improve their overall performances. In the literature, in both static and streaming environments, most studies focus on the binary class imbalance problem. Furthermore, research in multi-class imbalance in the data stream environment is limited. A number of researchers have proceeded by transforming a multi-class imbalanced setting into multiple binary class problems. However, such a transformation does not allow the stream to be studied in the original form and may introduce bias. The research conducted in this thesis aims to address this research gap by proposing a novel online learning methodology that combines oversampling of the minority classes with cluster-based majority class under-sampling, without decomposing the data stream into multiple binary sets. Rather, sampling involves continuously selecting a balanced number of instances across all classes for model building. Our focus is on improving the rate of correctly predicting instances of the minority classes in multi-class imbalanced data streams, through the introduction of the Synthetic Minority Over-sampling Technique (SMOTE) and Cluster-based Under-sampling - Data Streams (SCUT-DS) methodologies. In this work, we dynamically balance the classes by utilizing a windowing mechanism during the incremental sampling process. Our SCUT-DS algorithms are evaluated using six different types of classification techniques, followed by comparing their results against a state-of-the-art algorithm. Our contributions are tested using both synthetic and real data sets. The experimental results show that the approaches developed in this thesis yield high prediction rates of minority instances as contained in the multiple minority classes within a non-evolving stream.
2

Evolutionary ensembles for imbalanced learning / Comitês evolucionários para aprendizado desbalanceado

Fernandes, Everlandio Rebouças Queiroz 13 August 2018 (has links)
In many real classification problems, the data set used for model induction is significantly imbalanced. This occurs when the number of examples of some classes is much lower than the other classes. Imbalanced datasets can compromise the performance of most classical classification algorithms. The classification models induced by such datasets usually present a strong bias towards the majority classes, tending to classify new instances as belonging to these classes. A commonly adopted strategy for dealing with this problem is to train the classifier on a balanced sample from the original dataset. However, this procedure can discard examples that could be important for a better class discrimination, reducing classifier efficiency. On the other hand, in recent years several studies have shown that in different scenarios the strategy of combining several classifiers into structures known as ensembles has proved to be quite effective. This strategy has led to a stable predictive accuracy and, in particular, to a greater generalization ability than the classifiers that make up the ensemble. This generalization power of classifier ensembles has been the focus of research in the imbalanced learning field in order to reduce the bias toward the majority classes, despite the complexity involved in generating efficient ensembles. Optimization meta-heuristics, such as evolutionary algorithms, have many applications for ensemble learning, although they are little used for this purpose. For example, evolutionary algorithms maintain a set of possible solutions and diversify these solutions, which helps to escape out of the local optimal. In this context, this thesis investigates and develops approaches to deal with imbalanced datasets, using ensemble of classifiers induced by samples taken from the original dataset. More specifically, this theses propose three solutions based on evolutionary ensemble learning and a fourth proposal that uses a pruning mechanism based on dominance ranking, a common concept in multiobjective evolutionary algorithms. Experiments showed the potential of the developed solutions. / Em muitos problemas reais de classificação, o conjunto de dados usado para a indução do modelo é significativamente desbalanceado. Isso ocorre quando a quantidade de exemplos de algumas classes é muito inferior às das outras classes. Conjuntos de dados desbalanceados podem comprometer o desempenho da maioria dos algoritmos clássicos de classificação. Os modelos de classificação induzidos por tais conjuntos de dados geralmente apresentam um forte viés para as classes majoritárias, tendendo classificar novas instâncias como pertencentes a essas classes. Uma estratégia comumente adotada para lidar com esse problema, é treinar o classificador sobre uma amostra balanceada do conjunto de dados original. Entretanto, esse procedimento pode descartar exemplos que poderiam ser importantes para uma melhor discriminação das classes, diminuindo a eficiência do classificador. Por outro lado, nos últimos anos, vários estudos têm mostrado que em diferentes cenários a estratégia de combinar vários classificadores em estruturas conhecidas como comitês tem se mostrado bastante eficaz. Tal estratégia tem levado a uma acurácia preditiva estável e principalmente a apresentar maior habilidade de generalização que os classificadores que compõe o comitê. Esse poder de generalização dos comitês de classificadores tem sido foco de pesquisas no campo de aprendizado desbalanceado, com o objetivo de diminuir o viés em direção as classes majoritárias, apesar da complexidade que envolve gerar comitês de classificadores eficientes. Meta-heurísticas de otimização, como os algoritmos evolutivos, têm muitas aplicações para o aprendizado de comitês, apesar de serem pouco usadas para este fim. Por exemplo, algoritmos evolutivos mantêm um conjunto de soluções possíveis e diversificam essas soluções, o que auxilia na fuga dos ótimos locais. Nesse contexto, esta tese investiga e desenvolve abordagens para lidar com conjuntos de dados desbalanceados, utilizando comitês de classificadores induzidos a partir de amostras do conjunto de dados original por meio de metaheurísticas. Mais especificamente, são propostas três soluções baseadas em aprendizado evolucionário de comitês e uma quarta proposta que utiliza um mecanismo de poda baseado em ranking de dominância, conceito comum em algoritmos evolutivos multiobjetivos. Experimentos realizados mostraram o potencial das soluções desenvolvidas.
3

Evolutionary ensembles for imbalanced learning / Comitês evolucionários para aprendizado desbalanceado

Everlandio Rebouças Queiroz Fernandes 13 August 2018 (has links)
In many real classification problems, the data set used for model induction is significantly imbalanced. This occurs when the number of examples of some classes is much lower than the other classes. Imbalanced datasets can compromise the performance of most classical classification algorithms. The classification models induced by such datasets usually present a strong bias towards the majority classes, tending to classify new instances as belonging to these classes. A commonly adopted strategy for dealing with this problem is to train the classifier on a balanced sample from the original dataset. However, this procedure can discard examples that could be important for a better class discrimination, reducing classifier efficiency. On the other hand, in recent years several studies have shown that in different scenarios the strategy of combining several classifiers into structures known as ensembles has proved to be quite effective. This strategy has led to a stable predictive accuracy and, in particular, to a greater generalization ability than the classifiers that make up the ensemble. This generalization power of classifier ensembles has been the focus of research in the imbalanced learning field in order to reduce the bias toward the majority classes, despite the complexity involved in generating efficient ensembles. Optimization meta-heuristics, such as evolutionary algorithms, have many applications for ensemble learning, although they are little used for this purpose. For example, evolutionary algorithms maintain a set of possible solutions and diversify these solutions, which helps to escape out of the local optimal. In this context, this thesis investigates and develops approaches to deal with imbalanced datasets, using ensemble of classifiers induced by samples taken from the original dataset. More specifically, this theses propose three solutions based on evolutionary ensemble learning and a fourth proposal that uses a pruning mechanism based on dominance ranking, a common concept in multiobjective evolutionary algorithms. Experiments showed the potential of the developed solutions. / Em muitos problemas reais de classificação, o conjunto de dados usado para a indução do modelo é significativamente desbalanceado. Isso ocorre quando a quantidade de exemplos de algumas classes é muito inferior às das outras classes. Conjuntos de dados desbalanceados podem comprometer o desempenho da maioria dos algoritmos clássicos de classificação. Os modelos de classificação induzidos por tais conjuntos de dados geralmente apresentam um forte viés para as classes majoritárias, tendendo classificar novas instâncias como pertencentes a essas classes. Uma estratégia comumente adotada para lidar com esse problema, é treinar o classificador sobre uma amostra balanceada do conjunto de dados original. Entretanto, esse procedimento pode descartar exemplos que poderiam ser importantes para uma melhor discriminação das classes, diminuindo a eficiência do classificador. Por outro lado, nos últimos anos, vários estudos têm mostrado que em diferentes cenários a estratégia de combinar vários classificadores em estruturas conhecidas como comitês tem se mostrado bastante eficaz. Tal estratégia tem levado a uma acurácia preditiva estável e principalmente a apresentar maior habilidade de generalização que os classificadores que compõe o comitê. Esse poder de generalização dos comitês de classificadores tem sido foco de pesquisas no campo de aprendizado desbalanceado, com o objetivo de diminuir o viés em direção as classes majoritárias, apesar da complexidade que envolve gerar comitês de classificadores eficientes. Meta-heurísticas de otimização, como os algoritmos evolutivos, têm muitas aplicações para o aprendizado de comitês, apesar de serem pouco usadas para este fim. Por exemplo, algoritmos evolutivos mantêm um conjunto de soluções possíveis e diversificam essas soluções, o que auxilia na fuga dos ótimos locais. Nesse contexto, esta tese investiga e desenvolve abordagens para lidar com conjuntos de dados desbalanceados, utilizando comitês de classificadores induzidos a partir de amostras do conjunto de dados original por meio de metaheurísticas. Mais especificamente, são propostas três soluções baseadas em aprendizado evolucionário de comitês e uma quarta proposta que utiliza um mecanismo de poda baseado em ranking de dominância, conceito comum em algoritmos evolutivos multiobjetivos. Experimentos realizados mostraram o potencial das soluções desenvolvidas.
4

Why are pulsars hard to find?

Lyon, Robert James January 2016 (has links)
Searches for pulsars during the past fifty years, have been characterised by two problems making their discovery difficult: i) an increasing volume of data to be searched, and ii) an increasing number of `candidate' pulsar detections arising from that data, requiring analysis. Whilst almost all are caused by noise or interference, these are often indistinguishable from real pulsar detections. Deciding which candidates should be studied is therefore difficult. Indeed it has become known as the `candidate selection problem'. This thesis presents an interdisciplinary study of the selection problem, with the aim of developing a new method able to mitigate it. Specifically for future pulsar surveys undertaken with the Square kilometre Array (SKA). Through a combination of critical literature evaluations, theoretical modelling exercises, and empirical investigations, the selection problem is described in-depth here for the first time. It is shown to be characterised by the dominance of Gaussian distributed noise signals, a factor that no existing selection method accounts for. It also reveals the presence of a significant trend in survey data rates, which suggest that candidate selection is transitioning from an off-line processing procedure, to an on-line, and real-time, decision making process. In response, a new real-time machine learning based method, the GH-VFDT, is introduced in this thesis. The results presented here show that a significant improvement in selection performance can be achieved using the GH-VFDT, which utilises a learning procedure optimised for data characterised by skewed class distributions. Whilst the principled development of new numerical features that maximise the separation between pulsars and Gaussian noise, have also greatly improved GH-VFDT pulsar recall. It is therefore concluded that the sub-optimal performance of existing selection systems, is due to a combination of poor feature design, insensitivity to noise, and an inability to deal with skewed class distributions.
5

Statistical Learning with Imbalanced Data

Shipitsyn, Aleksey January 2017 (has links)
In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling. A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean. In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network. From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance.
6

Learning in the Presence of Skew and Missing Labels Through Online Ensembles and Meta-reinforcement Learning

Vafaie, Parsa 07 September 2021 (has links)
Data streams are large sequences of data, possibly endless and temporarily ordered, that are common-place in Internet of Things (IoT) applications such as intrusion detection in computer networking, fraud detection in financial institutions, real-time tumor tracking in radiotherapy and social media analysis. Algorithms learning from such streams need to be able to construct near real-time models that continuously adapt to potential changes in patterns, in order to retain high performance throughout the stream. It follows that there are numerous challenges involved in supervised learning (or so-called classification) in such environments. One of the challenges in learning from streams is multi-class imbalance, in which the rates of instances in the different class labels differ substantially. Notably, classification algorithms may become biased towards the classes with more frequent instances, sacrificing the performance of the less frequent or so-called minority classes. Further, minority instances often arrive infrequently and in bursts, making accurate model construction problematic. For example, network intrusion detection systems must be able to distinguish between normal traffic and multiple minority classes corresponding to a variety of different types of attacks. Further, having labels for all instances are often infeasible, since we might have missing or late-arriving labels. For instance, when learning from a stream regarding the task of detecting network intrusions, the true label for all instances might not be available, or it might take time until the label is made available, especially for new types of attacks. In this thesis, we contribute to the advancements of online learning from evolving streams by focusing on the above-mentioned areas of multi-class imbalance and missing labels. First, we introduce a multi-class online ensemble algorithm designed to maintain a balanced performance over all classes. Specifically, our approach samples instances with replacement while dynamically increasing the weights of under-represented classes, in order to produce models that benefit all classes. Our experimental results show that our online ensemble method performs well against multi-class imbalanced data in various datasets. We further continue our study by introducing an approach to dealing with missing labels that utilize both labelled and unlabelled data to increase a model’s performance. That is, our method utilizes labelled data for pseudo-labelling unlabelled instances, allowing the model to perform better in environments where labels are scarce. More specifically, our approach features a meta-reinforcement learning agent, trained on multiple-source streams, that can effectively select the prediction of a K nearest neighbours (K-NN) classifier as the label for unlabelled instances. Extensive experiments on benchmark datasets demonstrate the value and effectiveness of our approach and confirm that our method outperforms state-of-the-art.
7

Multi-Class Imbalanced Learning for Time Series Problem : An Industrial Case Study

Andersson, Melanie January 2020 (has links)
Classification problems with multiple classes and imbalanced sample sizes present a new challenge than the binary classification problems. Methods have been proposed to handle imbalanced learning, however most of them are specifically designed for binary classification problems. Multi-class imbalance imposes additional challenges when applied to time series classification problems, such as weather classification. In this thesis, we introduce, apply and evaluate a new algorithm for handling multi-class imbalanced problems involving time series data. Our proposed algorithm is designed to handle both multi-class imbalance and time series classification problems and is inspired by the Imbalanced Fuzzy-Rough Ordered Weighted Average Nearest Neighbor Classification algorithm. The feasibility of our proposed algorithm is studied through an empirical evaluation performed on a telecom use-case at Ericsson, Sweden where data from commercial microwave links is used for weather classification. Our proposed algorithm is compared to the currently used model at Ericsson which is a one-dimensional convolutional neural network, as well as three other deep learning models. The empirical evaluation indicates that the performance of our proposed algorithm for weather classification is comparable to that of the current solution. Our proposed algorithm and the current solution are the two best performing models of the study.
8

Overcoming the Curse of Missing and Noisy Data in Computational Drug Design

Meng, Fanwang January 2022 (has links)
Machine learning (ML) has enjoyed great success in chemistry and drug design, from designing synthetic pathways to drug screening, to biomolecular property predictions, etc.. However, ML model's generalizability and robustness require high-quality training data, which is often difficult to obtain, especially when the training data is acquired from experimental measurements. While one can always discard all data associated with noisy and/or missing values, this often results in discarding invaluable data. This thesis presents and applies mathematical techniques to solve this problem, and applies them to problems in molecular medicinal chemistry. In chapter 1, we indicate that the missing-data problem can be expressed as a matrix completion problem, and we point out how frequently matrix completion problems arise in (bio)chemical problems. Next, we use matrix completion to impute the missing values in protein-NMR data, and use this as a stepping-stone for understanding protein allostery in Chapter 2. This chapter also used several other techniques from statistical data analysis and machine learning, including denoising (from robust principal component analysis), latent feature identification from singular-value decomposition, and residue clustering by a Gaussian mixture model. In chapter 3, Δ-learning was used to predict the free energies of hydration (Δ𝐺). The aim of this study is to correct estimated hydration energies from low-level quantum chemistry calculations using continuum solvation models without significant additional computation. Extensive feature engineering, with 8 different regression algorithms and with Gaussian process regression (38 different kernels) were used to construct the predictive models. The optimal model gives us MAE of 0.6249 kcal/mol and RMSE of 1.0164 kcal/mol. Chapter 4 provides an open-source computational tool Procrustes to find the maximum similarities between metrics. Some examples are also given to show how to use Procrustes for chemical and biological problems. Finally, in Chapters 5 and 6, a database for permeability of the blood-brain barrier (BBB) was curated, and combined with resampling strategies to form predictive models. The resulting models have promising performance and are released along with a computational tool B3clf for its evaluation. / Thesis / Doctor of Science (PhD)
9

Imbalanced Learning and Feature Extraction in Fraud Detection with Applications / Obalanserade Metoder och Attribut Aggregering för Upptäcka Bedrägeri, med Appliceringar

Jacobson, Martin January 2021 (has links)
This thesis deals with fraud detection in a real-world environment with datasets coming from Svenska Handelsbanken. The goal was to investigate how well machine learning can classify fraudulent transactions and how new additional features affected classification. The models used were EFSVM, RUTSVM, CS-SVM, ELM, MLP, Decision Tree, Extra Trees, and Random Forests. To determine the best results the Mathew Correlation Coefficient was used as performance metric, which has been shown to have a medium bias for imbalanced datasets. Each model could deal with high imbalanced datasets which is common for fraud detection. Best results were achieved with Random Forest and Extra Trees. The best scores were around 0.4 for the real-world datasets, though the score itself says nothing as it is more a testimony to the dataset’s separability. These scores were obtained when using aggregated features and not the standard raw dataset. The performance measure recall’s scores were around 0.88-0.93 with an increase in precision by 34.4%-67%, resulting in a large decrease of False Positives. Evaluation results showed a great difference compared to test-runs, either substantial increase or decrease. Two theories as to why are discussed, a great distribution change in the evaluation set, and the sample size increase (100%) for evaluation could have lead to the tests not being well representing of the performance. Feature aggregation were a central topic of this thesis, with the main focus on behaviour features which can describe patterns and habits of customers. For these there were five categories: Sender’s fraud history, Sender’s transaction history, Sender’s time transaction history, Sender’shistory to receiver, and receiver’s history. Out of these, the best performance increase was from the first which gave the top score, the other datasets did not show as much potential, with mostn ot increasing the results. Further studies need to be done before discarding these features, to be certain they don’t improve performance. Together with the data aggregation, a tool (t-SNE) to visualize high dimension data was usedto great success. With it an early understanding of what to expect from newly added features would bring to classification. For the best dataset it could be seen that a new sub-cluster of transactions had been created, leading to the belief that classification scores could improve, whichthey did. Feature selection and PCA-reduction techniques were also studied and PCA showedgood results and increased performance. Feature selection had not conclusive improvements. Over- and under-sampling were used and neither improved the scores, though undersampling could maintain the results which is interesting when increasing the dataset. / Denna avhandling handlar om upptäcka bedrägerier i en real-world miljö med data från Svenska Handelsbanken. Målet var att undersöka hur bra maskininlärning är på att klassificera bedrägliga transaktioner, och hur nya attributer hjälper klassificeringen. Metoderna som användes var EFSVM, RUTSVM, CS-SVM, ELM, MLP, Decision Tree, Extra Trees och Random Forests. För evaluering av resultat används Mathew Correlation Coefficient, vilket har visat sig ha småttt beroende med hänsyn till obalanserade datamängder. Varje modell har inbygda värden för attklara av att bearbeta med obalanserade datamängder, vilket är viktigt för att upptäcka bedrägerier. Resultatmässigt visade det sig att Random Forest och Extra Trees var bäst, utan att göra p-test:s, detta på grund att dataseten var relativt sätt små, vilket gör att små skillnader i resultat ej är säkra. De högsta resultaten var cirka 0.4, det absoluta värdet säger ingenting mer än som en indikation om graden av separation mellan klasserna. De bäst resultaten ficks när nya aggregerade attributer användes och inte standard datasetet. Dessa resultat hade recall värden av 0,88-0,93 och för dessa kunde det synas precision ökade med 34,4% - 67%, vilket ger en stor minskning av False Positives. Evluation-resultaten hade stor skillnad mot test-resultaten, denna skillnad hade antingen en betydande ökning eller minskning. Två anledningar om varför diskuterades, förändring av evaluation-datan mot test-datan eller att storleksökning (100%) för evaluation har lett till att testerna inte var representativa. Attribute-aggregering var ett centralt ämne, med fokus på beteende-mönster för att beskriva kunders vanor. För dessa fanns det fem kategorier: Avsändarens bedrägerihistorik, Avsändarens transaktionshistorik, Avsändarens historik av tid för transaktion, Avsändarens historik till mottagaren och mottagarens historik. Av dessa var den största prestationsökningen från bedrägerihistorik, de andra attributerna hade inte lika positiva resultat, de flesta ökade inte resultaten.Ytterligare mer omfattande studier måste göras innan dessa attributer kan sägas vara givande eller ogivande. Tillsammans med data-aggregering användes t-SNE för att visualisera högdimensionsdata med framgång. Med t-SNE kan en tidig förståelse för vad man kan förvänta sig av tillagda attributer, inom klassificering. För det bästa dataset kan man se att ett nytt kluster som hade skapats, vilket kan tolkas som datan var mer beskrivande. Där förväntades också resultaten förbättras, vilket de gjorde. Val av attributer och PCA-dimensions reducering studerades och PCA-visadeförbättring av resultaten. Over- och under-sampling testades och kunde ej förbättrade resultaten, även om undersampling kunde bibehålla resultated vilket är intressant om datamängden ökar.
10

Previsão de falta de materiais no contexto de gestão inteligente de inventário: uma aplicação de aprendizado desbalanceado

Santis, Rodrigo Barbosa de 26 March 2018 (has links)
Submitted by Geandra Rodrigues (geandrar@gmail.com) on 2018-06-19T13:13:53Z No. of bitstreams: 1 rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2018-06-27T11:12:01Z (GMT) No. of bitstreams: 1 rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5) / Made available in DSpace on 2018-06-27T11:12:01Z (GMT). No. of bitstreams: 1 rodrigobarbosadesantis.pdf: 2597054 bytes, checksum: b19542ca0e9312572d8ffa5896d735db (MD5) Previous issue date: 2018-03-26 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Falta de materiais é um problema comum na cadeia de suprimentos, impactando o nível de serviço e eficiência de um sistema de inventário. A identificação de materiais com grande riscos de falta antes da ocorrência do evento pode apresentar uma enorme oportunidade de melhoria no desempenho geral de uma empresa. No entanto, a complexidade deste tipo de problema é alta, devido ao desbalanceamento das classes de itens faltantes e não faltantes no inventário, que podem chegar a razões de 1 para 100. No presente trabalho, algoritmos de classificação são investigados para proposição de um modelo preditivo para preencher esta lacuna na literatura. Algumas métricas específicas como a área abaixo das curvas de Característica Operacionais do Receptor e de Precisão-Abrangência, bem como técnicas de amostragem e comitês de aprendizado são aplicados nesta tarefa. O modelo proposto foi testado em dois estudos de caso reais, nos quais verificou-se que adoção da ferramenta pode contribuir com o aumento do nível de serviço em uma cadeia de suprimentos. / Material backorder (or stockout) is a common supply chain problem, impacting the inventory system service level and effectiveness. Identifying materials with the highest chances of shortage prior its occurrence can present a high opportunity to improve the overall company’s performance. However, the complexity of this sort of problem is high, due to class imbalance between missing items and not missing ones in inventory, which can achieve proportions of 1 to 100. In this work, machine learning classifiers are investigated in order to fulfill this gap in literature. Specific metrics such as area under the Receiver Operator Characteristic and precision-recall curves, sampling techniques and ensemble learning are employed to this particular task. The proposed model was tested in two real case-studies, in which it was verified that the use of the tool may contribute with the improvemnet of the service level in the supply chain.

Page generated in 0.0998 seconds