Global ETD Search

1	Um estudo de limpeza em base de dados desbalanceada e com sobreposição de classes Machado, Emerson Lopes 04 1900 (has links) Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2007. / Submitted by Luis Felipe Souza (luis_felas@globo.com) on 2008-12-10T18:56:04Z No. of bitstreams: 1 Dissertacao_2007_EmersonMachado.pdf: 480909 bytes, checksum: 33454d8cde13ccd0274df91f48a4125d (MD5) / Approved for entry into archive by Georgia Fernandes(georgia@bce.unb.br) on 2009-03-04T12:18:48Z (GMT) No. of bitstreams: 1 Dissertacao_2007_EmersonMachado.pdf: 480909 bytes, checksum: 33454d8cde13ccd0274df91f48a4125d (MD5) / Made available in DSpace on 2009-03-04T12:18:48Z (GMT). No. of bitstreams: 1 Dissertacao_2007_EmersonMachado.pdf: 480909 bytes, checksum: 33454d8cde13ccd0274df91f48a4125d (MD5) / O objetivo geral desta pesquisa é analisar técnicas para aumentar a acurácia de classificadores construídos a partir de bases de dados desbalanceadas. Uma base de dados é desbalanceada quando possui muito mais casos de uma classe do que das outras, portanto possui classes raras. O desbalanceamento também pode ser em uma mesma classe se a distribuição dos valores dos atributos for muito assimétrica, levando à ocorrência de casos raros. Algoritmos classificadores são muito sensíveis a estes tipos de desbalanceamentos e tendem a valorizar as classes (ou casos) predominantes e a ignorar as classes (ou casos) de menor freqüência. Modelos gerados para bases de dados com classes raras apresentam baixa acurácia para estas classes, o que é problemático quando elas são classes de interesse (ou quando uma delas é a classe de interesse). Já os casos raros podem ser ignorados pelos algoritmos classificadores, o que é problemático quando tais casos pertencem à classe (ou às classes) de interesse. Uma nova proposição de algoritmo é o Cluster-based Smote, que se baseia na combinação dos métodos de Cluster-based Oversampling (oversampling por replicação de casos guiada por clusters) e no SMOTE (oversampling por geração de casos sintéticos). O método Cluster-based Oversampling visa melhorar a aprendizagem de pequenos disjuntos, geralmente relacionados a casos raros, mas causa overfitting do modelo ao conjunto de treinamento. O método SMOTE gera novos casos sintéticos ao invés de replicar casos existentes, mas não enfatiza casos raros. A combinação desses algoritmos, chamada de Clusterbased Smote, apresentou resultados melhores do que a aplicação deles em separado em oito das nove bases de dados utilizadas proposta nesta pesquisa. A outra abordagem proposta nesta pesquisa visa a diminuir a sobreposição de classes possivelmente provocada pela aplicação do método SMOTE. Intuitivamente, esta abordagem consiste em guiar a aplicação do SMOTE com a aprendizagem não supervisionada proporcionada pela clusterização. O método implementado sob esta abordagem, denominado de C-clear, resultou em melhora significativa em relação ao SMOTE em três das nove bases testadas e empatou nas demais. Foi também proposta uma nova abordagem para limpeza de dados baseada na aprendizagem não supervisionada, a qual foi incorporada ao C-clear. Esta limpeza somente surtiu melhora em uma base de dados, sendo este baixo desempenho oriundo possivelmente da escolha não adequada de seus parâmetros de limpeza. A aprendizagem destes parâmetros a partir dos dados ficou como trabalho futuro. ___________________________________________________________________________________________ ABSTRACT / It is intended in this work to research methods that improve the accuracy of classifiers applied to data set with class imbalance (high skew in class distribution causing rare classes) and within-class imbalance (high skew in data within-class distribution causing care cases). Standard classifier algorithms are strongly affected by these characteristics and their generated model are biased to the majority classes (or cases), in detriment of classes (or cases) underrepresented. Generally, models generated with imbalanced data set suffer from low accuracy for the minority classes, which is a problem when the target class is one of them. Eventually, rare cases are likely of being ignored by inductors, which is a problem when they belong to the interesting class (or classes). A new method is proposed in this work, Cluster-based Smote, which combines the methods Cluster-based Oversampling (oversampling by replication of positive cases guided by clusters) and SMOTE (Synthetic Minority Oversampling Technique). Cluster-based Oversampling addresses small disjuncts, but overfits the model to the training set. The method SMOTE addresses the overfit problem of random oversampling, but does not treat rare cases. The combination of them proposed in this research, named Cluster-based Smote, presented better results in eight out of nine datasets, compared to the applying of them all alone. Another approach proposed in this research aims at reducing the class overlap problem possibly caused by applying SMOTE. The main idea is to guide the SMOTE process by non-supervised learning (with clustering techniques). The method implemented under this approach, named Cclear, resulted in significant improvement over SMOTE in three out of nine datasets. A cleaning method based in the non-supervised learning was also proposed and has been incorporated in the C-clear method. The cleaning method improved the results in only one dataset, probably because of the not so well values chosen as cleaning parameters. The learning of these parameters from the data is left as a future work. Mineração de dados (Computação) Desbalanceamento de classe Sobreposição de classe SMOTE Cluster-based Oversampling Cluster-based Smote C-clear
2	Credit Card Transaction Fraud Detection Using Neural Network Classifiers / Detektering av bedrägliga korttransaktioner m.h.a neurala nätverk Nazeriha, Ehsan January 2023 (has links) With increasing usage of credit card payments, credit card fraud has also been increasing. Therefore a fast and accurate fraud detection system is vital for the banks. To solve the problem of fraud detection, different machine learning classifiers have been designed and trained on a credit card transaction dataset. However, the dataset is heavily imbalanced which poses a problem for the performance of the algorithms. To resolve this issue, the generative methods Generative Adversarial Network (GAN), Variational Autoencoders (VAE) and Synthetic Minority Oversampling Technique (SMOTE) have been used to generate synthetic samples for the minority class in order to achieve a more balanced dataset. The main purpose of this study is to evaluate the generative methods and investigate the impact of their generated minority samples on the classifiers. The results from this study indicated that GAN does not outperform the other classifiers as the generated samples from VAE were most effective in three out of five classifiers. Also the validation and histogram of the generated samples indicate that the VAE samples have captured the distribution of the data better than SMOTE and GAN. A suggestion to improve on this work is to perform data engineering on the dataset. For instance, using correlation analysis for the features and analysing which features have the greatest impact on the classification and subsequently dropping the less important features and train the generative methods and classifiers with the trimmed down samples. / Med ökande användning av kreditkort som betalningsmetod i världen, har även kreditkort bedrägeri ökat. Därför finns det behov av ett snabbt och tillförligt system för att upptäcka bedrägliga transkationer. För att lösa problemet med att detektera kreditkort bedrägerier, har olika maskininlärnings klassifiseringsmetoder designats och tränats med ett dataset som innehåller kreditkortstransaktioner. Dock är dessa dataset väldigt obalanserade och innehåller mest normala transaktioner, vilket är problematiskt för systemets noggranhet vid klassificering. Därför har generativa metoderna Generative adversarial networks, Variational autoencoder och Synthetic minority oversampling technique använs för att skapa syntetisk data av minoritetsklassen för att balansera datasetet och uppnå bättre noggranhet. Det centrala målet med denna studie var därmed att evaluera dessa generativa metoder och invetigera påverkan av de syntetiska datapunkterna på klassifiseringsmetoderna. Resultatet av denna studie visade att den generativa metoden generative adversarial networks inte överträffade de andra generativa metoderna då syntetisk data från variational autoencoders var mest effektiv i tre av de fem klassifisieringsmetoderna som testades i denna studie. Dessutom visar valideringsmetoden att variational autoencoder lyckades bäst med att lära sig distributionen av orginal datat bättre än de andra generativa metoderna. Ett förslag för vidare utveckling av denna studie är att jobba med data behandling på datasetet innan datasetet används för träning av algoritmerna. Till exempel kan man använda korrelationsanalys för att analysera vilka features i datasetet har störst påverkan på klassificeringen och därmed radera de minst viktiga och sedan träna algortimerna med data som innehåller färre features. GAN Deep Learning Variational Autoencoder Anomaly Detection SMOTE GAN Djupinlärning Variational Autoencoder Anomali detektering SMOTE Computer and Information Sciences Data- och informationsvetenskap
3	Applying the Wrapper Approach for Auto Discovery of Under-Sampling and Over-Sampling Percentages on Skewed Datasets Joshi, Ajay D 03 November 2004 (has links) Machine learning applications are plagued by the imbalance observed among the class sizes in many real world datasets. A dataset is said to be skewed or imbalanced when its classes are very unequally represented. A naÃ¯ve classifier learned from these skewed datasets is always biased towards the majority classes which constitute a major percentage of the samples in the dataset. As a result the accuracy on the minority classes is hampered. In many real world applications like network intrusion detection, cancer detection from mammography images, etc. the events of interest are very rare and the cost of not detecting these events is very high. Hence it very important to improve accuracies on the minority classes. It has been proposed previously that under-sampling of the majority classes can reduce the bias of the learned classifier and over-sampling of the minority classes - especially SMOTE (Synthetic Minority Over-sampling TEchnique) can boost the classifier accuracy on minority classes. But the question of how much under-sampling and over-sampling to be done for a particular induction learning algorithm and dataset remains. We present a wrapper approach for searching for the under-sampling and over-sampling (i.e. SMOTE) percentages for a particular learning algorithm for a given skewed dataset. We compare the results obtained by the classifiers built on wrapper selected under sampled and SMOTEd datasets with the ones obtained by classifiers built on the original datasets to show a statistically significant improvement in accuracies over minority classes. This proves the efficacy of the wrapper approach in searching for the under-sampling and over-sampling percentages. Further, it provides an automated method to select the number of synthetic examples to be created. Machine Learning Data Mining SMOTE RIPPER imbalance C4.5 F-value American Studies Arts and Humanities
4	Utilização de técnicas de inteligência artificial para classificação de crianças cardiopatas em base de dados desbalanceadas Tavares, Thiago Ribeiro 31 January 2013 (has links) Submitted by João Arthur Martins (joao.arthur@ufpe.br) on 2015-03-12T17:23:07Z No. of bitstreams: 2 Dissertacao Thiago Tavares.pdf: 3582760 bytes, checksum: dfee6c424fc987631aeae3fbd4e4e524 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Approved for entry into archive by Daniella Sodre (daniella.sodre@ufpe.br) on 2015-03-13T13:23:44Z (GMT) No. of bitstreams: 2 Dissertacao Thiago Tavares.pdf: 3582760 bytes, checksum: dfee6c424fc987631aeae3fbd4e4e524 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Made available in DSpace on 2015-03-13T13:23:44Z (GMT). No. of bitstreams: 2 Dissertacao Thiago Tavares.pdf: 3582760 bytes, checksum: dfee6c424fc987631aeae3fbd4e4e524 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Previous issue date: 2013 / As doenças cardiovasculares são as que mais matam no Brasil e no mundo. Dessas, a cardiopatia congênita, uma malformação cardíaca presente desde o nascimento, acomete 8 a 10 em cada 1000 nascidos vivos e aproximadamente 1/3 deles necessitam de tratamento já no primeiro ano de vida. Inúmeros trabalhos demonstram que quanto antes for estabelecido o diagnóstico maiores serão as chances de sucesso no tratamento. O atendimento de crianças com suspeita de cardiopatia gera uma grande quantidade de informação, porém a diferenciação entre sinais e sintomas normais ou patológicos logo no início, por exemplo, na marcação da consulta, pode ser aspecto fundamental para agilizar o atendimento. Há algum tempo a Inteligência Artificial, mais especificamente a subárea de Mineração de Dados, tem sido utilizada como ferramenta de suporte à decisão médica em diversas especialidades, inclusive na cardiologia. Apesar da maioria das aplicações nesse contexto utilizarem Árvore de Decisão para classificação devido ao seu poder de interpretação e extração de regras, Máquinas de Vetor de Suporte (Support Vector Machines - SVM) têm demonstrado, em várias aplicações, um maior poder de generalização apresentando melhores resultados. No entanto, esse tipo de algoritmo, caixa-preta, não produz um conhecimento explícito de modo que um médico, especialista no domínio, possa interpretá-lo. A proposta desse trabalho é o desenvolvimento de um sistema de apoio à decisão médica que auxilie na detecção de cardiopatias em crianças, a partir de dados iniciais, como gênero, peso, altura e presença de sopros, com o objetivo de priorizar o seu atendimento médico. Técnicas para lidar com bases de dados desbalanceadas, tais como SMOTE e SVM com pesos foram utilizadas a fim de melhorar os resultados com relação a classificadores convencionais. Além disso, foi possível realizar a extração de regras a partir dos resultados obtidos pela SVM. Segundo os especialistas, os resultados obtidos viabilizam a utilização do sistema de apoio à decisão que pode ser incorporado à prática clínica para melhorar a qualidade dos serviços prestados. Mineração de dados em medicina SVM com pesos Bases de dados desbalanceadas SMOTE Árvore de decisão Sistemas de apoio a diagnóstico
5	The Impact of Real Big Data on our Future and Risk Identification Al-Shouiliy, Khaldoon 27 September 2020 (has links) No description available. Computer Science AzureML Jungle forest P2P SMOTE-3D Big data Breast Cancer
6	Enhancing Telecom Churn Prediction: Adaboost with Oversampling and Recursive Feature Elimination Approach Tran, Long Dinh 01 June 2023 (has links) (PDF) Churn prediction is a critical task for businesses to retain their valuable customers. This paper presents a comprehensive study of churn prediction in the telecom sector using 15 approaches, including popular algorithms such as Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, and AdaBoost. The study is segmented into three sets of experiments, each focusing on a different approach to building the churn prediction model. The model is constructed using the original training set in the first set of experiments. The second set involves oversampling the training set to address the issue of imbalanced data. Lastly, the third set combines oversampling with recursive feature selection to enhance the model's performance further. The results demonstrate that the Adaptive Boost classifier, implemented with oversampling and recursive feature selection, outperforms the other 14 techniques. It achieves the highest rank in all three evaluation metrics: recall (0.841), f1-score (0.655), and roc_auc (0.793), further indicating that the proposed approach effectively predicts churn and provides valuable insights into customer behavior. Churn Prediction Unbalanced Datasets Oversampling SMOTE Recursive Feature Selection RFE Machine Learning
7	Detecting Fraudulent User Behaviour : A Study of User Behaviour and Machine Learning in Fraud Detection Gerdelius, Patrik, Hugo, Sjönneby January 2024 (has links) This study aims to create a Machine Learning model and investigate its performance of detecting fraudulent user behaviour on an e-commerce platform. The user data was analysed to identify and extract critical features distinguishing regular users from fraudulent users. Two different types of user data were used; Event Data and Screen Data, spanning over four weeks. A Principal Component Analysis (PCA) was applied to the Screen Data to reduce its dimensionality. Feature Engineering was conducted on both Event Data and Screen Data. A Random Forest model, a supervised ensemble method, was used for classification. The data was imbalanced due to a significant difference in number of frauds compared to regular users. Therefore, two different balancing methods were used: Oversampling (SMOTE) and changing the Probability Threshold (PT) for the classification model. The best result was achieved with the resampled data where the threshold was set to 0,4. The result of this model was a prediction of 80,88% of actual frauds being predicted as such, while 0,73% of the regular users were falsely predicted as frauds. While this result was promising, questions are raised regarding the validity since there is a possibility that the model was over-fitted on the data set. An indication of this was that the result was significantly less accurate without resampling. However, the overall conclusion from the result was that this study shows an indication that it is possible to distinguish frauds from regular users, with or without resampling. For future research, it would be interesting to see data over a more extended period of time and train the model on real-time data to counter changes in fraudulent behaviour. Fraud Detection User Behaviour Random Forest PCA SMOTE Computer Sciences Datavetenskap (datalogi)
8	The Application of Synthetic Signals for ECG Beat Classification Brown, Elliot Morgan 01 September 2019 (has links) A brief overview of electrocardiogram (ECG) properties and the characteristics of various cardiac conditions is given. Two different models are used to generate synthetic ECG signals. Domain knowledge is used to create synthetic examples of 16 different heart beat types with these models. Other techniques for synthesizing ECG signals are explored. Various machine learning models with different combinations of real and synthetic data are used to classify individual heart beats. The performance of the different methods and models are compared, and synthetic data is shown to be useful in beat classification. ECG synthetic data SMOTE signals classification machine learning neural networks Mathematics Physical Sciences and Mathematics
9	Adaptation des techniques actuelles de scoring aux besoins d'une institution de crédit : le CFCAL-Banque / Adaptation of current scoring techniques to the needs of a credit institution : the Crédit Foncier et Communal d'Alsace et de Lorraine (CFCAL-banque) Kouassi, Komlan Prosper 26 July 2013 (has links) Les institutions financières sont, dans l’exercice de leurs fonctions, confrontées à divers risques, entre autres le risque de crédit, le risque de marché et le risque opérationnel. L’instabilité de ces facteurs fragilise ces institutions et les rend vulnérables aux risques financiers qu’elles doivent, pour leur survie, être à même d’identifier, analyser, quantifier et gérer convenablement. Parmi ces risques, celui lié au crédit est le plus redouté par les banques compte tenu de sa capacité à générer une crise systémique. La probabilité de passage d’un individu d’un état non risqué à un état risqué est ainsi au cœur de nombreuses questions économiques. Dans les institutions de crédit, cette problématique se traduit par la probabilité qu’un emprunteur passe d’un état de "bon risque" à un état de "mauvais risque". Pour cette quantification, les institutions de crédit recourent de plus en plus à des modèles de credit-scoring. Cette thèse porte sur les techniques actuelles de credit-scoring adaptées aux besoins d’une institution de crédit, le CFCAL-banque, spécialisé dans les prêts garantis par hypothèques. Nous présentons en particulier deux modèles non paramétriques (SVM et GAM) dont nous comparons les performances en termes de classification avec celles du modèle logit traditionnellement utilisé dans les banques. Nos résultats montrent que les SVM sont plus performants si l’on s’intéresse uniquement à la capacité de prévision globale. Ils exhibent toutefois des sensibilités inférieures à celles des modèles logit et GAM. En d’autres termes, ils prévoient moins bien les emprunteurs défaillants. Dans l’état actuel de nos recherches, nous préconisons les modèles GAM qui ont certes une capacité de prévision globale moindre que les SVM, mais qui donnent des sensibilités, des spécificités et des performances de prévision plus équilibrées. En mettant en lumière des modèles ciblés de scoring de crédit, en les appliquant sur des données réelles de crédits hypothécaires, et en les confrontant au travers de leurs performances de classification, cette thèse apporte une contribution empirique à la recherche relative aux modèles de credit-scoring. / Financial institutions face in their functions a variety of risks such as credit, market and operational risk. These risks are not only related to the nature of the activities they perform, but also depend on predictable external factors. The instability of these factors makes them vulnerable to financial risks that they must appropriately identify, analyze, quantify and manage. Among these risks, credit risk is the most prominent due to its ability to generate a systemic crisis. The probability for an individual to switch from a risked to a riskless state is thus a central point to many economic issues. In credit institution, this problem is reflected in the probability for a borrower to switch from a state of “good risk” to a state of “bad risk”. For this quantification, banks increasingly rely on credit-scoring models. This thesis focuses on the current credit-scoring techniques tailored to the needs of a credit institution: the CFCAL-banque specialized in mortgage credits. We particularly present two nonparametric models (SVM and GAM) and compare their performance in terms of classification to those of logit model traditionally used in banks. Our results show that SVM are more effective if we only focus on the global prediction performance of the models. However, SVM models give lower sensitivities than logit and GAM models. In other words the predictions of SVM models on defaulted borrowers are not satisfactory as those of logit or GAM models. In the present state of our research, even GAM models have lower global prediction capabilities, we recommend these models that give more balanced sensitivities, specificities and performance prediction. This thesis is not completely exhaustive about the scoring techniques for credit risk management. By trying to highlight targeted credit scoring models, adapt and apply them on real mortgage data, and compare their performance through classification, this thesis provides an empirical and methodological contribution to research on scoring models for credit risk management. Risque de crédit Credit-scoring Probabilité de défaut Hyperplan séparateur Scoring par les SVM Technique SMOTE Scoring par les GAM Smooth backfitting Credit risk Credit-scoring Probability of default Separating hyperplane Support vector machines (SVM) SMOTE technique Scoring with GAM Smooth backfitting 332.7
10	SCUT-DS: Methodologies for Learning in Imbalanced Data Streams Olaitan, Olubukola January 2018 (has links) The automation of most of our activities has led to the continuous production of data that arrive in the form of fast-arriving streams. In a supervised learning setting, instances in these streams are labeled as belonging to a particular class. When the number of classes in the data stream is more than two, such a data stream is referred to as a multi-class data stream. Multi-class imbalanced data stream describes the situation where the instance distribution of the classes is skewed, such that instances of some classes occur more frequently than others. Classes with the frequently occurring instances are referred to as the majority classes, while the classes with instances that occur less frequently are denoted as the minority classes. Classification algorithms, or supervised learning techniques, use historic instances to build models, which are then used to predict the classes of unseen instances. Multi-class imbalanced data stream classification poses a great challenge to classical classification algorithms. This is due to the fact that traditional algorithms are usually biased towards the majority classes, since they have more examples of the majority classes when building the model. These traditional algorithms yield low predictive accuracy rates for the minority instances and need to be augmented, often with some form of sampling, in order to improve their overall performances. In the literature, in both static and streaming environments, most studies focus on the binary class imbalance problem. Furthermore, research in multi-class imbalance in the data stream environment is limited. A number of researchers have proceeded by transforming a multi-class imbalanced setting into multiple binary class problems. However, such a transformation does not allow the stream to be studied in the original form and may introduce bias. The research conducted in this thesis aims to address this research gap by proposing a novel online learning methodology that combines oversampling of the minority classes with cluster-based majority class under-sampling, without decomposing the data stream into multiple binary sets. Rather, sampling involves continuously selecting a balanced number of instances across all classes for model building. Our focus is on improving the rate of correctly predicting instances of the minority classes in multi-class imbalanced data streams, through the introduction of the Synthetic Minority Over-sampling Technique (SMOTE) and Cluster-based Under-sampling - Data Streams (SCUT-DS) methodologies. In this work, we dynamically balance the classes by utilizing a windowing mechanism during the incremental sampling process. Our SCUT-DS algorithms are evaluated using six different types of classification techniques, followed by comparing their results against a state-of-the-art algorithm. Our contributions are tested using both synthetic and real data sets. The experimental results show that the approaches developed in this thesis yield high prediction rates of minority instances as contained in the multiple minority classes within a non-evolving stream. Multi-class Imbalanced Learning Imbalanced data sets Data streams Classification Imbalanced Learning Sampling Cluster-based Under-sampling Synthetic Oversampling Augmenting Minority Examples Online Learning SMOTE-based Oversampling

Search results