Global ETD Search

11	"Novas abordagens em aprendizado de máquina para a geração de regras, classes desbalanceadas e ordenação de casos" / "New approaches in machine learning for rule generation, class imbalance and rankings" Ronaldo Cristiano Prati 07 July 2006 (has links) Algoritmos de aprendizado de máquina são frequentemente os mais indicados em uma grande variedade de aplicações de mineração dados. Entretanto, a maioria das pesquisas em aprendizado de máquina refere-se ao problema bem definido de encontrar um modelo (geralmente de classificação) de um conjunto de dados pequeno, relativamente bem preparado para o aprendizado, no formato atributo-valor, no qual os atributos foram previamente selecionados para facilitar o aprendizado. Além disso, o objetivo a ser alcançado é simples e bem definido (modelos de classificação precisos, no caso de problemas de classificação). Mineração de dados propicia novas direções para pesquisas em aprendizado de máquina e impõe novas necessidades para outras. Com a mineração de dados, algoritmos de aprendizado estão quebrando as restrições descritas anteriormente. Dessa maneira, a grande contribuição da área de aprendizado de máquina para a mineração de dados é retribuída pelo efeito inovador que a mineração de dados provoca em aprendizado de máquina. Nesta tese, exploramos alguns desses problemas que surgiram (ou reaparecem) com o uso de algoritmos de aprendizado de máquina para mineração de dados. Mais especificamente, nos concentramos seguintes problemas: Novas abordagens para a geração de regras. Dentro dessa categoria, propomos dois novos métodos para o aprendizado de regras. No primeiro, propomos um novo método para gerar regras de exceção a partir de regras gerais. No segundo, propomos um algoritmo para a seleção de regras denominado Roccer. Esse algoritmo é baseado na análise ROC. Regras provêm de um grande conjunto externo de regras e o algoritmo proposto seleciona regras baseado na região convexa do gráfico ROC. Proporção de exemplos entre as classes. Investigamos vários aspectos relacionados a esse tópico. Primeiramente, realizamos uma série de experimentos em conjuntos de dados artificiais com o objetivo de testar nossa hipótese de que o grau de sobreposição entre as classes é um fator complicante em conjuntos de dados muito desbalanceados. Também executamos uma extensa análise experimental com vários métodos (alguns deles propostos neste trabalho) para balancear artificialmente conjuntos de dados desbalanceados. Finalmente, investigamos o relacionamento entre classes desbalanceadas e pequenos disjuntos, e a influência da proporção de classes no processo de rotulação de exemplos no algoritmo de aprendizado de máquina semi-supervisionado Co-training. Novo método para a combinação de rankings. Propomos um novo método, chamado BordaRank, para construir ensembles de rankings baseado no método de votação borda count. BordaRank pode ser aplicado em qualquer problema de ordenação binária no qual vários rankings estejam disponíveis. Resultados experimentais mostram uma melhora no desempenho com relação aos rankings individuais, alem de um desempenho comparável com algoritmos mais sofisticados que utilizam a predição numérica, e não rankings, para a criação de ensembles para o problema de ordenação binária. / Machine learning algorithms are often the most appropriate algorithms for a great variety of data mining applications. However, most machine learning research to date has mainly dealt with the well-circumscribed problem of finding a model (generally a classifier) given a single, small and relatively clean dataset in the attribute-value form, where the attributes have previously been chosen to facilitate learning. Furthermore, the end-goal is simple and well-defined, such as accurate classifiers in the classification problem. Data mining opens up new directions for machine learning research, and lends new urgency to others. With data mining, machine learning is now removing each one of these constraints. Therefore, machine learning's many valuable contributions to data mining are reciprocated by the latter's invigorating effect on it. In this thesis, we explore this interaction by proposing new solutions to some problems due to the application of machine learning algorithms to data mining applications. More specifically, we contribute to the following problems. New approaches to rule learning. In this category, we propose two new methods for rule learning. In the first one, we propose a new method for finding exceptions to general rules. The second one is a rule selection algorithm based on the ROC graph. Rules come from an external larger set of rules and the algorithm performs a selection step based on the current convex hull in the ROC graph. Proportion of examples among classes. We investigated several aspects related to this issue. Firstly, we carried out a series of experiments on artificial data sets in order to verify our hypothesis that overlapping among classes is a complicating factor in highly skewed data sets. We also carried out a broadly experimental analysis with several methods (some of them proposed by us) that artificially balance skewed datasets. Our experiments show that, in general, over-sampling methods perform better than under-sampling methods. Finally, we investigated the relationship between class imbalance and small disjuncts, as well as the influence of the proportion of examples among classes in the process of labelling unlabelled cases in the semi-supervised learning algorithm Co-training. New method for combining rankings. We propose a new method called BordaRanking to construct ensembles of rankings based on borda count voting, which could be applied whenever only the rankings are available. Results show an improvement upon the base-rankings constructed by taking into account the ordering given by classifiers which output continuous-valued scores, as well as a comparable performance with the fusion of such scores. aprendizado de máquina classes desbalanceadas combinação de rankings geração de regras class imbalance ensemble of rankings machine learning rule learning
12	Machine Learning-based Prediction and Characterization of Drug-drug Interactions Yella, Jaswanth January 2018 (has links) No description available. Computer Science Machine Learning Drug-Drug Interactions Similarity-based learning Class Imbalance Pharmacovigilance Supervised Learning
13	A systematic study of the class imbalance problem in convolutional neural networks Buda, Mateusz January 2017 (has links) In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks and compare frequently used methods to address the issue. Class imbalance refers to significantly different number of examples among classes in a training set. It is a common problem that has been comprehensively studied in classical machine learning, yet very limited systematic research is available in the context of deep learning. We define and parameterize two representative types of imbalance, i.e. step and linear. Using three benchmark datasets of increasing complexity, MNIST, CIFAR-10 and ImageNet, we investigate the effects of imbalance on classification and perform an extensive comparison of several methods to address the issue: oversampling, undersampling, two-phase training, and thresholding that compensates for prior class probabilities. Our main evaluation metric is area under the receiver operating characteristic curve (ROC AUC) adjusted to multi-class tasks since overall accuracy metric is associated with notable difficulties in the context of imbalanced data. Based on results from our experiments we conclude that (i) the effect of class imbalance on classification performance is detrimental and increases with the extent of imbalance and the scale of a task; (ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling; (iii) oversampling should be applied to the level that totally eliminates the imbalance, whereas undersampling can perform better when the imbalance is only removed to some extent; (iv) thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest; (v) as opposed to some classical machine learning models, oversampling does not necessarily cause overfitting of convolutional neural networks. / I den här studien undersöker vi systematiskt effekten av klassobalans på prestandan för klassificering hos konvolutionsnätverk och jämför vanliga metoder för att åtgärda problemet. Klassobalans avser betydlig ojämvikt hos antalet exempel per klass i ett träningsset. Det är ett vanligt problem som har studerats utförligt inom maskininlärning, men tillgången av systematisk forskning inom djupinlärning är starkt begränsad. Vi definerar och parametriserar två representiva typer av obalans, steg och linjär. Med hjälpav tre dataset med ökande komplexitet, MNIST, CTFAR-10 och ImageNet, undersöker vi effekterna av obalans på klassificering och utför en omfattande jämförelse av flera metoder för att åtgärda problemen: översampling, undersampling, tvåfasträning och avgränsning för tidigare klass-sannolikheter. Vår huvudsakliga utvärderingsmetod är arean under mottagarens karaktäristiska kurva (ROC AUC) justerat för multi-klass-syften, eftersom den övergripande noggrannheten är förenad med anmärkningsvärda svårigheter i samband med obalanserade data. Baserat på experimentens resultat drar vi slutsatserna att (i) effekten av klassens obalans påklassificeringprestanda är skadlig och ökar med mängden obalans och omfattningen av uppgiften; (ii) metoden att ta itu med klassobalans som framträdde som dominant i nästan samtliga analyserade scenarier var översampling; (iii) översampling bör tillämpas till den nivå som helt eliminerar obalansen, medan undersampling kan prestera bättre när obalansen bara avlägsnas i en viss utsträckning; (iv) avgränsning bör tillämpas för att kompensera för tidigare sannolikheter när det totala antalet korrekt klassificerade fall är av intresse; (v) i motsats till hos vissa klassiska maskininlärningsmodeller orsakar översampling inte nödvändigtvis överanpassning av konvolutionsnätverk. Class Imbalance Convolutional Neural Networks Deep Learning Image Classification Computer Systems Datorsystem
14	Towards Fairness-Aware Online Machine Learning from Imbalanced Data Streams Sadeghi, Farnaz 10 August 2023 (has links) Online supervised learning from fast-evolving imbalanced data streams has applications in many areas. That is, the development of techniques that are able to handle highly skewed class distributions (or 'class imbalance') is an important area of research in domains such as manufacturing, the environment, and health. Solutions should be able to analyze large repositories in near real-time and provide accurate models to describe rare classes that may appear infrequently or in bursts while continuously accommodating new instances. Although numerous online learning methods have been proposed to handle binary class imbalance, solutions suitable for multi-class streams with varying degrees of imbalance in evolving streams have received limited attention. To address this knowledge gap, the first contribution of this thesis introduces the Online Learning from Imbalanced Multi-Class Streams through Dynamic Sampling (DynaQ) algorithm for learning in such multi-class imbalanced settings. Our approach utilizes a queue-based learning method that dynamically creates an instance queue for each class. The number of instances is balanced by maintaining a queue threshold and removing older samples during training. In addition, new and rare classes are dynamically added to the training process as they appear. Our experimental results confirm a noticeable improvement in minority-class detection and classification performance. A comparative evaluation shows that the DynaQ algorithm outperforms the state-of-the-art approaches. Our second contribution in this thesis focuses on fairness-aware learning from imbalanced streams. Our work is motivated by the observation that the decisions made by online learning algorithms may negatively impact individuals or communities. Indeed, the development of approaches to handle these concerns is an active area of research in the machine learning community. However, most existing methods process the data in offline settings and are not directly suitable for online learning from evolving data streams. Further, these techniques fail to take the effects of class imbalance, on fairness-aware supervised learning into account. In addition, recent fairness-aware online learning supervised learning approaches focus on one sensitive attribute only, which may lead to subgroup discrimination. In a fair classification, the equality of fairness metrics across multiple overlapping groups must be considered simultaneously. In our second contribution, we thus address the combined problem of fairness-aware online learning from imbalanced evolving streams, while considering multiple sensitive attributes. To this end, we introduce the Multi-Sensitive Queue-based Online Fair Learning (MQ-OFL) algorithm, an online fairness-aware approach, which maintains valid and fair models over evolving streams. MQ-OFL changes the training distribution in an online fashion based on both stream imbalance and discriminatory behavior of the model evaluated over the historical stream. We compare our MQ-OFL method with state-of-art studies on real-world datasets and present comparative insights on the performance. Our final contribution focuses on explainability and interpretability in fairness-aware online learning. This research is guided by the concerns raised due to the black-box nature of models, concealing internal logic from users. This lack of transparency poses practical and ethical challenges, particularly when these algorithms make decisions in finance, healthcare, and marketing domains. These systems may introduce biases and prejudices during the learning phase by utilizing complex machine learning algorithms and sensitive data. Consequently, decision models trained on such data may make unfair decisions and it is important to realize such issues before deploying the models. To address this issue, we introduce techniques for interpreting the outcomes of fairness-aware online learning. Through a case study predicting income based on features such as ethnicity, biological sex, age, and education level, we demonstrate how our fairness-aware learning process (MQ-OFL) maintains a balance between accuracy and discrimination trade-off using global and local surrogate models. Online Machine Learning Multi-class Imbalance Data stream Concept Drift Fairness-aware Classification Multi-sensitive Attribute Explainable Fairness Model Interpretability
15	Neural Networks for Predictive Maintenance on Highly Imbalanced Industrial Data Montilla Tabares, Oscar January 2023 (has links) Preventive maintenance plays a vital role in optimizing industrial operations. However, detecting equipment needing such maintenance using available data can be particularly challenging due to the class imbalance prevalent in real-world applications. The datasets gathered from equipment sensors primarily consist of records from well-functioning machines, making it difficult to identify those on the brink of failure, which is the main focus of preventive maintenance efforts. In this study, we employ neural network algorithms to address class imbalance and cost sensitivity issues in industrial scenarios for preventive maintenance. Our investigation centers on the "APS Failure in the Scania Trucks Data Set," a binary classification problem exhibiting significant class imbalance and cost sensitivity issues—a common occurrence across various fields. Inspired by image detection techniques, we introduce a novel loss function called Focal loss to traditional neural networks, combined with techniques like Cost-Sensitive Learning and Threshold Calculation to enhance classification accuracy. Our study's novelty is adapting image detection techniques to tackle the class imbalance problem within a binary classification task. Our proposed method demonstrates improvements in addressing the given optimization problem when confronted with these issues, matching or surpassing existing machine learning and deep learning techniques while maintaining computational efficiency. Our results indicate that class imbalance can be addressed without relying on conventional sampling techniques, which typically come at the cost of increased computational cost (oversampling) or loss of critical information (undersampling). In conclusion, our proposed method presents a promising approach for addressing class imbalance and cost sensitivity issues in industrial datasets heavily affected by these phenomena. It contributes to developing preventive maintenance solutions capable of enhancing the efficiency and productivity of industrial operations by detecting machines in need of attention: this discovery process we term predictive maintenance. The artifact produced in this study showcases the utilization of Focal Loss, Cost-Sensitive Learning, and Threshold Calculation to create reliable and effective predictive maintenance solutions for real-world applications. This thesis establishes a method that contributes to the body of knowledge in binary classification within machine learning, specifically addressing the challenges mentioned above. Our research findings have broader implications beyond industrial classification tasks, extending to other fields, such as medical or cybersecurity classification problems. The artifact (code) is at: https://shorturl.at/lsNSY Class Imbalance Cost Sensitivity Cost-Sensitive Learning Focal Loss Binary Classification Machine Learning Deep Learning Computer Sciences Datavetenskap (datalogi)
16	<b>GOING FOR IT ALL: IDENTIFICATION OF ENVIRONMENTAL RISK FACTORS AND PREDICTION OF GESTATIONAL DIABETES MELLITUS USING MULTI-LEVEL LOGISTIC REGRESSION IN THE PRESENCE OF CLASS IMBALANCE</b> Carolina Gonzalez Canas (17593284) 11 December 2023 (has links) <p dir="ltr">Gestational Diabetes Mellitus (GDM) is defined as glucose intolerance with first onset during pregnancy in women without previous history of diabetes. The global prevalence of GDM oscillates between 2% and 17%, varying across countries and ethnicities. In the United States (U.S.), every year up to 13% of pregnancies are affected by this disease. Several risk factors for GDM are well established, such as race, age and BMI, while additional factors have been proposed that could affect the risk of developing the disease; some of them are modifiable, such as diet, while others are not, such as environmental factors.</p><p dir="ltr">Taking effective preventive actions against GDM require the early identification of women at highest risk. A crucial task to this end is the establishment of factors that increase the probabilities of developing the disease. These factors are both individual characteristics and choices and likely include environmental conditions.</p><p dir="ltr">The first part of the dissertation focuses on examining the relationship between food insecurity and GDM by using the National Health and Nutrition Examination Survey (NHANES), which has a representative sample of the U.S. population. The aim of this analysis is to determine a national estimate of the impact of food environment on the likelihood of developing GDM stratified by race and ethnicity. A survey weighted logistic regression model is used to assess these relationships which are described using odds ratios.</p><p dir="ltr">The goal of the second part of this research is to determine whether a woman’s risk of developing GDM is affected by her environment, also referred to in this work as level 2 variables. For that purpose, Medicaid claims information from Indiana was analyzed using a multilevel logistic regression model with sample balancing to improve the class imbalance ratio.</p><p dir="ltr">Finally, for the third part of this dissertation, a simulation study was performed to examine the impact of balancing on the prediction quality and inference of model parameters when using multilevel logistic regression models. Data structure and generating model for the data were informed by the findings from the second project using the Medicaid data. This is particularly relevant for medical data that contains measurements at the individual level combined with other data sources measured at the regional level and both prediction and model interpretation are of interest.</p> Industrial engineering Multilevel Modeling Gestational Diabetes Class-imbalance NHANES database Logistic multilevel modeling undersampling strategies oversampling strategies
17	Benevolent and Malevolent Adversaries: A Study of GANs and Face Verification Systems Nazari, Ehsan 22 November 2023 (has links) Cybersecurity is rapidly evolving, necessitating inventive solutions for emerging challenges. Deep Learning (DL), having demonstrated remarkable capabilities across various domains, has found a significant role within Cybersecurity. This thesis focuses on benevolent and malevolent adversaries. For the benevolent adversaries, we analyze specific applications of DL in Cybersecurity contributing to the enhancement of DL for downstream tasks. Regarding the malevolent adversaries, we explore the question of how resistant to (Cyber) attacks is DL and show vulnerabilities of specific DL-based systems. We begin by focusing on the benevolent adversaries by studying the use of a generative model called Generative Adversarial Networks (GAN) to improve the abilities of DL. In particular, we look at the use of Conditional Generative Adversarial Networks (CGAN) to generate synthetic data and address issues with imbalanced datasets in cybersecurity applications. Imbalanced classes can be a significant issue in this field and can lead to serious problems. We find that CGANs can effectively address this issue, especially in more difficult scenarios. Then, we turn our attention to using CGAN with tabular cybersecurity problems. However, visually assessing the results of a CGAN is not possible when we are dealing with tabular cybersecurity data. To address this issue, we introduce AutoGAN, a method that can train a GAN on both image-based and tabular data, reducing the need for human inspection during GAN training. This opens up new opportunities for using GANs with tabular datasets, including those in cybersecurity that are not image-based. Our experiments show that AutoGAN can achieve comparable or even better results than other methods. Finally, we shift our focus to the malevolent adversaries by looking at the robustness of DL models in the context of automatic face recognition. We know from previous research that DL models can be tricked into making incorrect classifications by adding small, almost unnoticeable changes to an image. These deceptive manipulations are known as adversarial attacks. We aim to expose new vulnerabilities in DL-based Face Verification (FV) systems. We introduce a novel attack method on FV systems, called the DodgePersonation Attack, and a system for categorizing these attacks based on their specific targets. We also propose a new algorithm that significantly improves upon a previous method for making such attacks, increasing the success rate by more than 13%. Cybersecurity Machine Learning Computer Vision and Pattern Recognition Generative Adversarial Networks Face Verification systems Class Imbalance Problem Adversarial Attacks
18	Prediction of Large for Gestational Age Infants in Ethnically Diverse Datasets Using Machine Learning Techniques. Development of 3rd Trimester Machine Learning Prediction Models and Identification of Important Features Using Dimensionality Reduction Techniques Sabouni, Sumaia January 2023 (has links) University of Bradford through the International Development Fund / The full text will be available at the end of the embargo: 21st June 2025 Large for gestational age Gestational diabetes Macrosomia Obesity Machine learning Prediction Class imbalance Algorithms Ethnically diverse datasets
19	Σχεδιασμός και υλοποίηση πολυκριτηριακής υβριδικής μεθόδου ταξινόμησης βιολογικών δεδομένων με χρήση εξελικτικών αλγορίθμων και νευρωνικών δικτύων Σκρεπετός, Δημήτριος 09 October 2014 (has links) Δύσκολα προβλήματα ταξινόμησης από τον χώρο της Βιοπληροφορικής όπως η πρόβλεψη των microRNA γονιδιών και η πρόβλεψη των πρωτεϊνικών αλληλεπιδράσεων (Protein- Protein Interactions) απαιτούν ισχυρούς ταξινομητές οι οποίοι θα πρέπει να έχουν καλή ακρίβεια ταξινόμησης, να χειρίζονται ελλιπείς τιμές, να είναι ερμηνεύσιμοι, και να μην πάσχουν από το πρόβλημα ανισορροπίας κλάσεων. Ένας ευρέως χρησιμοποιούμενος ταξινομητής είναι τα νευρωνικά δίκτυα, τα οποία ωστόσο χρειάζονται προσδιορισμό της αρχιτεκτονικής τους και των λοιπών παραμέτρων τους, ενώ και οι αλγόριθμοι εκμάθησής τους συνήθως συγκλίνουν σε τοπικά ελάχιστα. Για τους λόγους αυτούς, προτείνεται μία πολυκριτηριακή εξελικτική μέθοδος η οποία βασίζεται στους εξελικτικούς αλγορίθμους ώστε να βελτιστοποιήσει πολλά από τα προαναφερθέντα κριτήρια απόδοσης των νευρωνικών δικτύων, να βρει επίσης την βέλτιση αρχιτεκτονική καθώς και ένα ολικό ελάχιστο για τα συναπτικά τους βάρη. Στην συνέχεια, από τον πληθυσμό που προκύπτει χρησιμοποιούμε το σύνολό του ώστε να επιτύχουμε την ταξινόμηση. / Hard classification problems of the area of Bioinformatics, like microRNA prediction and PPI prediction, demand powerful classifiers which must have good prediction accuracy, handle missing values, be interpretable, and not suffer from the class imbalance problem. One wide used classifier is neural networks, which need definition of their architecture and their other parameters, while their training algorithms usually converge to local minima. For those reasons, we suggest a multi-objective evolutionary method, which is based to evolutionary algorithms in order to optimise many of the aforementioned criteria of the performance of a neural network, and also find the optimised architecture and a global minimum for its weights. Then, from the ensuing population, we use it as an ensemble classifier in order to perform the classification. Νευρωνικά δίκτυα Ταξινόμηση Ελλιπείς τιμές 570.285 Evolutionary algorithms Neural networks Classification microRNA Protein-protein Interactioni Missing values Class imbalance problem
20	A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification Brandt, Jakob, Lanzén, Emil January 2021 (has links) In this thesis, the performance of two over-sampling techniques, SMOTE and ADASYN, is compared. The comparison is done on three imbalanced data sets using three different classification models and evaluation metrics, while varying the way the data is pre-processed. The results show that both SMOTE and ADASYN improve the performance of the classifiers in most cases. It is also found that SVM in conjunction with SMOTE performs better than with ADASYN as the degree of class imbalance increases. Furthermore, both SMOTE and ADASYN increase the relative performance of the Random forest as the degree of class imbalance grows. However, no pre-processing method consistently outperforms the other in its contribution to better performance as the degree of class imbalance varies. Machine learning supervised learning classification class imbalance over-sampling SMOTE ADASYN Sensitivity F-measure Matthews correlation coefficient Probability Theory and Statistics Sannolikhetsteori och statistik

Search results