Spelling suggestions: "subject:"densemble learning"" "subject:"dfensemble learning""
1 |
A Comparative Study of Ensemble Active LearningAlabdulrahman, Rabaa January 2014 (has links)
Data Stream mining is an important emerging topic in the data mining and machine learning domain. In a Data Stream setting, the data arrive continuously and often at a fast pace. Examples include credit cards transaction records, surveillances video streams, network event logs, and telecommunication records. Such types of data bring new challenges to the data mining research community. Specifically, a number of researchers have developed techniques in order to build accurate classification models against such Data Streams. Ensemble Learning, where a number of so-called base classifiers are combined in order to build a model, has shown some promise. However, a number of challenges remain. Often, the class labels of the arriving data are incorrect or missing. Furthermore, Data Stream algorithms may benefit from an online learning paradigm, where a small amount of newly arriving data is used to learn incrementally. To this end, the use of Active Learning, where the user is in the loop, has been proposed as a way to extend Ensemble Learning. Here, the hypothesis is that Active Learning would increase the performance, in terms of accuracy, ensemble size, and the time it takes to build the model.
This thesis tests the validity of this hypothesis. Namely, we explore whether augmenting Ensemble Learning with an Active Learning component benefits the Data Stream Learning process. Our analysis indicates that this hypothesis does not necessarily hold for the datasets under consideration. That is, the accuracies of Active Ensemble Learning are not statistically significantly higher than when using normal Ensemble Learning. Rather, Active Learning may even cause an increase in error rate. Further, Active Ensemble Learning actually results in an increase in the time taken to build the model. However, our results indicate that Active Ensemble Learning builds accurate models against much smaller ensemble sizes, when compared to the traditional Ensemble Learning algorithms. Further, the models we build are constructed against small and incrementally growing training sets, which may be very beneficial in a real time Data Stream setting.
|
2 |
Distributed boosting algorithmsThompson, Simon Giles January 1999 (has links)
No description available.
|
3 |
A Novel Ensemble Machine Learning for Robust Microarray Data Classification.Peng, Yonghong January 2006 (has links)
No / Microarray data analysis and classification has demonstrated convincingly that it provides an effective methodology for the effective diagnosis of diseases and cancers. Although much research has been performed on applying machine learning techniques for microarray data classification during the past years, it has been shown that conventional machine learning techniques have intrinsic drawbacks in achieving accurate and robust classifications. This paper presents a novel ensemble machine learning approach for the development of robust microarray data classification. Different from the conventional ensemble learning techniques, the approach presented begins with generating a pool of candidate base classifiers based on the gene sub-sampling and then the selection of a sub-set of appropriate base classifiers to construct the classification committee based on classifier clustering. Experimental results have demonstrated that the classifiers constructed by the proposed method outperforms not only the classifiers generated by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods (bagging and boosting).
|
4 |
Weakly Selective Training induces Specialization within Populations of Sensory NeuronsHillmann, Julia 11 January 2016 (has links)
No description available.
|
5 |
Inferring Gene Regulatory Networks from Expression Data using Ensemble MethodsSlawek, Janusz 01 May 2014 (has links)
High-throughput technologies for measuring gene expression made inferring of the genome-wide Gene Regulatory Networks an active field of research. Reverse-engineering of systems of transcriptional regulations became an important challenge in molecular and computational biology. Because such systems model dependencies between genes, they are important in understanding of cell behavior, and can potentially turn observed expression data into the new biological knowledge and practical applications. In this dissertation we introduce a set of algorithms, which infer networks of transcriptional regulations from variety of expression profiles with superior accuracy compared to the state-of-the-art techniques. The proposed methods make use of ensembles of trees, which became popular in many scientific fields, including genetics and bioinformatics. However, originally they were motivated from the perspective of classification, regression, and feature selection theory. In this study we exploit their relative variable importance measure as an indication of the presence or absence of a regulatory interaction between genes. We further analyze their predictions on a set of the universally recognized benchmark expression data sets, and achieve favorable results in compare with the state-of-the-art algorithms.
|
6 |
Semi-Supervised Hybrid Windowing Ensembles for Learning from Evolving StreamsFloyd, Sean Louis Alan 03 June 2019 (has links)
In this thesis, learning refers to the intelligent computational extraction of knowledge from data. Supervised learning tasks require data to be annotated with labels, whereas for unsupervised learning, data is not labelled. Semi-supervised learning deals with data sets that are partially labelled. A major issue with supervised and semi-supervised learning of data streams is late-arriving or missing class labels. Assuming that correctly labelled data will always be available and timely is often unfeasible, and, as such, supervised methods are not directly applicable in the real world. Therefore, real-world problems usually require the use of semi-supervised or unsupervised learning techniques. For instance, when considering a spam detection task, it is not reasonable to assume that all spam will be identified (correctly labelled) prior to learning. Additionally, in semi-supervised learning, "the instances having the highest [predictive] confidence are not necessarily the most useful ones" [41]. We investigate how self-training performs without its selective heuristic in a streaming setting.
This leads us to our contributions. We extend an existing concept drift detector to operate without any labelled data, by using a sliding window of our ensemble's prediction confidence, instead of a boolean indicating whether the ensemble's predictions are correct. We also extend selective self-training, a semi-supervised learning method, by using all predictions, and not only those with high predictive confidence. Finally, we introduce a novel windowing type for ensembles, as sliding windows are very time consuming and regular tumbling windows are not a suitable replacement. Our windowing technique can be considered a hybrid of the two: we train each sub-classifier in the ensemble with tumbling windows, but delay training in such a way that only one sub-classifier can update its model per iteration.
We found, through statistical significance tests, that our framework is (roughly 160 times) faster than current state of the art techniques, and achieves comparable predictive accuracy. That being said, more research is needed to further reduce the quantity of labelled data used for training, while also increasing its predictive accuracy.
|
7 |
Penalised regression for high-dimensional data : an empirical investigation and improvements via ensemble learningWang, Fan January 2019 (has links)
In a wide range of applications, datasets are generated for which the number of variables p exceeds the sample size n. Penalised likelihood methods are widely used to tackle regression problems in these high-dimensional settings. In this thesis, we carry out an extensive empirical comparison of the performance of popular penalised regression methods in high-dimensional settings and propose new methodology that uses ensemble learning to enhance the performance of these methods. The relative efficacy of different penalised regression methods in finite-sample settings remains incompletely understood. Through a large-scale simulation study, consisting of more than 1,800 data-generating scenarios, we systematically consider the influence of various factors (for example, sample size and sparsity) on method performance. We focus on three related goals --- prediction, variable selection and variable ranking --- and consider six widely used methods. The results are supported by a semi-synthetic data example. Our empirical results complement existing theory and provide a resource to compare performance across a range of settings and metrics. We then propose a new ensemble learning approach for improving the performance of penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the ``base learner''. In simulations, we show that STRANDS typically improves upon its base learner, and demonstrate that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. We propose another ensemble learning method to improve the prediction performance of Ridge Regression in sparse settings. Specifically, we combine Bayesian Ridge Regression with a probabilistic forward selection procedure, where inclusion of a variable at each stage is probabilistically determined by a Bayes factor. We compare the prediction performance of the proposed method to penalised regression methods using simulated data.
|
8 |
Ensemble learning metody pro vývoj skóringových modelů / Ensemble learning methods for scoring models developmentNožička, Michal January 2018 (has links)
Credit scoring is very important process in banking industry during which each potential or current client is assigned credit score that in certain way expresses client's probability of default, i.e. failing to meet his or her obligations on time or in full amount. This is a cornerstone of credit risk management in banking industry. Traditionally, statistical models (such as logistic regression model) are used for credit scoring in practice. Despite many advantages of such approach, recent research shows many alternatives that are in some ways superior to those traditional models. This master thesis is focused on introducing ensemble learning models (in particular constructed by using bagging, boosting and stacking algorithms) with various base models (in particular logistic regression, random forest, support vector machines and artificial neural network) as possible alternatives and challengers to traditional statistical models used for credit scoring and compares their advantages and disadvantages. Accuracy and predictive power of those scoring models is examined using standard measures of accuracy and predictive power in credit scoring field (in particular GINI coefficient and LIFT coefficient) on a real world dataset and obtained results are presented. The main result of this comparative study is that...
|
9 |
A probabilistic perspective on ensemble diversityZanda, Manuela January 2010 (has links)
We study diversity in classifier ensembles from a broader perspectivethan the 0/1 loss function, the main reason being that the bias-variance decomposition of the 0/1 loss function is not unique, and therefore the relationship between ensemble accuracy and diversity is still unclear. In the parallel field of regression ensembles, where the loss function of interest is the mean squared error, this decomposition not only exists, but it has been shown that diversity can be managed via the Negative Correlation (NC) framework. In the field of probabilistic modelling the expected value of the negative log-likelihood loss function is given by its conditional entropy; this result suggests that interaction information might provide some insight into the trade off between accuracy and diversity. Our objective is to improve our understanding of classifier diversity by focusing on two different loss functions - the mean squared error and the negative log-likelihood. In a study of mean squared error functions, we reformulate the Tumer & Ghosh model for the classification error as a regression problem, and we show how the NC learning framework can be deployed to manage diversity in classification problems. In an empirical study of classifiers that minimise the negative log-likelihood loss function, we discuss model diversity as opposed to error diversity in ensembles of Naive Bayes classifiers. We observe that diversity in low-variance classifiers has to be structurally inferred. We apply interaction information to the problem of monitoring diversity in classifier ensembles. We present empirical evidence that interaction information can capture the trade-off between accuracy and diversity, and that diversity occurs at different levels of interactions between base classifiers. We use interaction information properties to build ensembles of structurally diverse averaged Augmented Naive Bayes classifiers. Our empirical study shows that this novel ensemble approach is computationally more efficient than an accuracy based approach and at the same time it does not negatively affect the ensemble classification performance.
|
10 |
Systematic ensemble learning and extensions for regression / Méthodes d'ensemble systématiques et extensions en apprentissage automatique pour la régressionAldave, Roberto January 2015 (has links)
Abstract : The objective is to provide methods to improve the performance, or prediction accuracy of standard stacking approach, which is an ensemble method composed of simple, heterogeneous base models, through the integration of the diversity generation, combination and/or selection stages for regression problems. In Chapter 1, we propose to combine a set of level-1 learners into a level-2 learner, or ensemble. We also propose to inject a diversity generation mechanism into the initial cross-validation partition, from which new cross-validation partitions are generated, and sub-sequent ensembles are trained. Then, we propose an algorithm to select best partition, or corresponding ensemble. In Chapter 2, we formulate the partition selection as a Pareto-based multi-criteria optimization problem, as well as an algorithm to make the partition selection iterative with the aim to improve more the ensemble prediction accuracy. In Chapter 3, we propose to generate multiple populations or partitions by injecting a diversity mechanism to the original dataset. Then, an algorithm is proposed to select the best partition among all partitions generated by the multiple populations. All methods designed and implemented in this thesis get encouraging, and favorably results across different dataset against both state-of-the-art models, and ensembles for regression. / Résumé : L’objectif est de fournir des techniques permettant d’améliorer la performance de l’algorithme de stacking, une méthode ensembliste composée de modèles de base simples et hétérogènes, à travers l’intégration de la génération de la diversité, la sélection
et combinaison des modèles. Dans le chapitre 1, nous proposons de combiner différents sous-ensembles de modèles de base obtenus au primer niveau. Nous proposons
un mécanisme pour injecter de la diversité dans la partition croisée initiale, à partir de laquelle de nouvelles partitions de validation croisée sont générées, et les ensembles correspondant sont formés. Ensuite, nous proposons un algorithme pour sélectionner la meilleure partition. Dans le chapitre 2, nous formulons la sélection de la partition comme un problème d’optimisation multi-objectif fondé sur un principe de Pareto, ainsi que d’un algorithme pour faire une application itérative de la sélection avec l’objectif d’améliorer d’avantage la précision d’ensemble. Dans le chapitre 3, nous proposons de générer plusieurs populations en injectant un mécanisme de diversité à l’ensemble de données original. Ensuite, un algorithme est proposé pour sélectionner la meilleur partition entre toutes les partitions produite par les multiples populations. Nous avons obtenu des résultats encourageants avec ces algorithmes lors de comparaisons avec des modèles reconnus sur plusieurs bases de données.
|
Page generated in 0.0792 seconds