Global ETD Search

1	An examination of indexes for determining the number of clusters in binary data sets Weingessel, Andreas, Dimitriadou, Evgenia, Dolnicar, Sara January 1999 (has links) (PDF) An examination of 14 indexes for determining the number of clusters is conducted on artificial binary data sets being generated according to various design factors. To provide a variety of clustering solutions the data sets are analyzed by different non hierarchical clustering methods. The purpose of the paper is to present the performance and the ability of an index to detect the proper number of clusters in a binary data set under various conditions and different difficulty levels. (author's abstract) / Series: Working Papers SFB "Adaptive Information Systems and Modelling in Economics and Management Science"
2	Odhadování přesnosti klasifikačních metod na základě vlasnosti dat / Estimating performance of classifiers from dataset properties Todt, Michal January 2018 (has links) The following thesis explores the impact of the dataset distributional prop- erties on classification performance. We use Gaussian copulas to generate 1000 artificial dataset and train classifiers on them. We train Generalized linear models, Distributed Random forest, Extremely randomized trees and Gradient boosting machines via H2O.ai machine learning platform accessed by R. Classi- fication performance on these datasets is evaluated and empirical observations on influence are presented. Secondly, we use real Australian credit dataset and predict which classifier is possibly going to work best. The predicted perfor- mance for any individual method is based on penalizing the differences between the Australian dataset and artificial datasets where the method performed com- paratively better, but it failed to predict correctly. 1
3	Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics Ding, Zejin 07 May 2011 (has links) In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem. Machine learning Classification Imbalanced data learning Diversified ensemble Active learning Artificial data generation Bioinformatics Protein methylation
4	Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics DING, ZEJIN 07 May 2011 (has links) In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem. Machine learning Classification Imbalanced data learning Diversified ensemble Active learning Artificial data generation Bioinformatics Protein methylation Computer Sciences
5	Data measures that characterise classification problems Van der Walt, Christiaan Maarten 29 August 2008 (has links) We have a wide-range of classifiers today that are employed in numerous applications, from credit scoring to speech-processing, with great technical and commercial success. No classifier, however, exists that will outperform all other classifiers on all classification tasks, and the process of classifier selection is still mainly one of trial and error. The optimal classifier for a classification task is determined by the characteristics of the data set employed; understanding the relationship between data characteristics and the performance of classifiers is therefore crucial to the process of classifier selection. Empirical and theoretical approaches have been employed in the literature to define this relationship. None of these approaches have, however, been very successful in accurately predicting or explaining classifier performance on real-world data. We use theoretical properties of classifiers to identify data characteristics that influence classifier performance; these data properties guide us in the development of measures that describe the relationship between data characteristics and classifier performance. We employ these data measures on real-world and artificial data to construct a meta-classification system. We use theoretical properties of classifiers to identify data characteristics that influence classifier performance; these data properties guide us in the development of measures that describe the relationship between data characteristics and classifier performance. We employ these data measures on real-world and artificial data to construct a meta-classification system. The purpose of this meta-classifier is two-fold: (1) to predict the classification performance of real-world classification tasks, and (2) to explain these predictions in order to gain insight into the properties of real-world data. We show that these data measures can be employed successfully to predict the classification performance of real-world data sets; these predictions are accurate in some instances but there is still unpredictable behaviour in other instances. We illustrate that these data measures can give valuable insight into the properties and data structures of real-world data; these insights are extremely valuable for high-dimensional classification problems. / Dissertation (MEng)--University of Pretoria, 2008. / Electrical, Electronic and Computer Engineering / unrestricted Classifier selection Data measures Data characteristics Artificial data Data analysis Classification Supervised learning Pattern recognition Meta-classification Classification prediction UCTD
6	Maximum-likelihood kernel density estimation in high-dimensional feature spaces /\| C.M. van der Walt Van der Walt, Christiaan Maarten January 2014 (has links) With the advent of the internet and advances in computing power, the collection of very large high-dimensional datasets has become feasible { understanding and modelling high-dimensional data has thus become a crucial activity, especially in the field of pattern recognition. Since non-parametric density estimators are data-driven and do not require or impose a pre-defined probability density function on data, they are very powerful tools for probabilistic data modelling and analysis. Conventional non-parametric density estimation methods, however, originated from the field of statistics and were not originally intended to perform density estimation in high-dimensional features spaces { as is often encountered in real-world pattern recognition tasks. Therefore we address the fundamental problem of non-parametric density estimation in high-dimensional feature spaces in this study. Recent advances in maximum-likelihood (ML) kernel density estimation have shown that kernel density estimators hold much promise for estimating nonparametric probability density functions in high-dimensional feature spaces. We therefore derive two new iterative kernel bandwidth estimators from the maximum-likelihood (ML) leave one-out objective function and also introduce a new non-iterative kernel bandwidth estimator (based on the theoretical bounds of the ML bandwidths) for the purpose of bandwidth initialisation. We name the iterative kernel bandwidth estimators the minimum leave-one-out entropy (MLE) and global MLE estimators, and name the non-iterative kernel bandwidth estimator the MLE rule-of-thumb estimator. We compare the performance of the MLE rule-of-thumb estimator and conventional kernel density estimators on artificial data with data properties that are varied in a controlled fashion and on a number of representative real-world pattern recognition tasks, to gain a better understanding of the behaviour of these estimators in high-dimensional spaces and to determine whether these estimators are suitable for initialising the bandwidths of iterative ML bandwidth estimators in high dimensions. We find that there are several regularities in the relative performance of conventional kernel density estimators across different tasks and dimensionalities and that the Silverman rule-of-thumb bandwidth estimator performs reliably across most tasks and dimensionalities of the pattern recognition datasets considered, even in high-dimensional feature spaces. Based on this empirical evidence and the intuitive theoretical motivation that the Silverman estimator optimises the asymptotic mean integrated squared error (assuming a Gaussian reference distribution), we select this estimator to initialise the bandwidths of the iterative ML kernel bandwidth estimators compared in our simulation studies. We then perform a comparative simulation study of the newly introduced iterative MLE estimators and other state-of-the-art iterative ML estimators on a number of artificial and real-world high-dimensional pattern recognition tasks. We illustrate with artificial data (guided by theoretical motivations) under what conditions certain estimators should be preferred and we empirically confirm on real-world data that no estimator performs optimally on all tasks and that the optimal estimator depends on the properties of the underlying density function being estimated. We also observe an interesting case of the bias-variance trade-off where ML estimators with fewer parameters than the MLE estimator perform exceptionally well on a wide variety of tasks; however, for the cases where these estimators do not perform well, the MLE estimator generally performs well. The newly introduced MLE kernel bandwidth estimators prove to be a useful contribution to the field of pattern recognition, since they perform optimally on a number of real-world pattern recognition tasks investigated and provide researchers and practitioners with two alternative estimators to employ for the task of kernel density estimation. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014 Pattern recognition Non-parametric density estimation Kernel density estimation Kernel bandwidth estimation Maximum-likelihood High-dimensional data Artificial data Probability density function
7	Maximum-likelihood kernel density estimation in high-dimensional feature spaces /\| C.M. van der Walt Van der Walt, Christiaan Maarten January 2014 (has links) With the advent of the internet and advances in computing power, the collection of very large high-dimensional datasets has become feasible { understanding and modelling high-dimensional data has thus become a crucial activity, especially in the field of pattern recognition. Since non-parametric density estimators are data-driven and do not require or impose a pre-defined probability density function on data, they are very powerful tools for probabilistic data modelling and analysis. Conventional non-parametric density estimation methods, however, originated from the field of statistics and were not originally intended to perform density estimation in high-dimensional features spaces { as is often encountered in real-world pattern recognition tasks. Therefore we address the fundamental problem of non-parametric density estimation in high-dimensional feature spaces in this study. Recent advances in maximum-likelihood (ML) kernel density estimation have shown that kernel density estimators hold much promise for estimating nonparametric probability density functions in high-dimensional feature spaces. We therefore derive two new iterative kernel bandwidth estimators from the maximum-likelihood (ML) leave one-out objective function and also introduce a new non-iterative kernel bandwidth estimator (based on the theoretical bounds of the ML bandwidths) for the purpose of bandwidth initialisation. We name the iterative kernel bandwidth estimators the minimum leave-one-out entropy (MLE) and global MLE estimators, and name the non-iterative kernel bandwidth estimator the MLE rule-of-thumb estimator. We compare the performance of the MLE rule-of-thumb estimator and conventional kernel density estimators on artificial data with data properties that are varied in a controlled fashion and on a number of representative real-world pattern recognition tasks, to gain a better understanding of the behaviour of these estimators in high-dimensional spaces and to determine whether these estimators are suitable for initialising the bandwidths of iterative ML bandwidth estimators in high dimensions. We find that there are several regularities in the relative performance of conventional kernel density estimators across different tasks and dimensionalities and that the Silverman rule-of-thumb bandwidth estimator performs reliably across most tasks and dimensionalities of the pattern recognition datasets considered, even in high-dimensional feature spaces. Based on this empirical evidence and the intuitive theoretical motivation that the Silverman estimator optimises the asymptotic mean integrated squared error (assuming a Gaussian reference distribution), we select this estimator to initialise the bandwidths of the iterative ML kernel bandwidth estimators compared in our simulation studies. We then perform a comparative simulation study of the newly introduced iterative MLE estimators and other state-of-the-art iterative ML estimators on a number of artificial and real-world high-dimensional pattern recognition tasks. We illustrate with artificial data (guided by theoretical motivations) under what conditions certain estimators should be preferred and we empirically confirm on real-world data that no estimator performs optimally on all tasks and that the optimal estimator depends on the properties of the underlying density function being estimated. We also observe an interesting case of the bias-variance trade-off where ML estimators with fewer parameters than the MLE estimator perform exceptionally well on a wide variety of tasks; however, for the cases where these estimators do not perform well, the MLE estimator generally performs well. The newly introduced MLE kernel bandwidth estimators prove to be a useful contribution to the field of pattern recognition, since they perform optimally on a number of real-world pattern recognition tasks investigated and provide researchers and practitioners with two alternative estimators to employ for the task of kernel density estimation. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014 Pattern recognition Non-parametric density estimation Kernel density estimation Kernel bandwidth estimation Maximum-likelihood High-dimensional data Artificial data Probability density function
8	Artificial data for Image classification in industrial applications Yonan, Yonan, Baaz, August January 2022 (has links) Machine learning and AI are growing rapidly and they are being implemented more often than before due to their high accuracy and performance. One of the biggest challenges to machine learning is data collection. The training data is the most important part of any machine learning project since it determines how the trained model will behave. In the case of object classification and detection, capturing a large number of images per object is not always possible and can be a very time-consuming and tedious process. This thesis explores options specific to image classification that help reducing the need to capture many images per object while still keeping the same performance accuracy. In this thesis, experiments have been performed with the goal of achieving a high classification accuracy with a limited dataset. One method that is explored is to create artificial training images using a game engine. Ways to expand a small dataset such as different data augmentation methods, and regularization methods, are also employed. / Maskininlärning och AI växer snabbt och de implementeras allt oftare på grund av deras höga noggrannhet och prestanda. En av de största utmaningarna för maskininlärning är datainsamling. Träningsdata är den viktigaste delen av ett maskininlärningsprojekt eftersom den avgör hur den tränade modellen kommer att bete sig. När det gäller objektklassificering och detektering är det inte alltid möjligt att ta många bilder per objekt och det kan vara en process som kräver mycket tid och arbete. Det här examensarbetet utforskar alternativ som är specifika för bildklassificering som minskar behovet av att ta många bilder per objekt samtidigt som prestanda bibehålls. I det här examensarbetet, flera experiment har utförts med målet att uppnå en hög klassificeringsprestanda med en begränsad dataset. En metod som utforskas är att skapa träningsbilder med hjälp av en spelmotor. Metoder för att utöka antal bilder i ett litet dataset, som data augmenteringsmetoder och regleringsmetoder, används också. Synthetic data artificial data object detection image classification artificial intelligence machine learning neural networks convolutional neural networks ResNet ResNet50 Engineering and Technology Teknik och teknologier Computer Engineering Datorteknik Computer Sciences Datavetenskap (datalogi)
9	Reconstruction of trees from 3D point clouds Stålberg, Martin January 2017 (has links) The geometrical structure of a tree can consist of thousands, even millions, of branches, twigs and leaves in complex arrangements. The structure contains a lot of useful information and can be used for example to assess a tree's health or calculate parameters such as total wood volume or branch size distribution. Because of the complexity, capturing the structure of an entire tree used to be nearly impossible, but the increased availability and quality of particularly digital cameras and Light Detection and Ranging (LIDAR) instruments is making it increasingly possible. A set of digital images of a tree, or a point cloud of a tree from a LIDAR scan, contains a lot of data, but the information about the tree structure has to be extracted from this data through analysis. This work presents a method of reconstructing 3D models of trees from point clouds. The model is constructed from cylindrical segments which are added one by one. Bayesian inference is used to determine how to optimize the parameters of model segment candidates and whether or not to accept them as part of the model. A Hough transform for finding cylinders in point clouds is presented, and used as a heuristic to guide the proposals of model segment candidates. Previous related works have mainly focused on high density point clouds of sparse trees, whereas the objective of this work was to analyze low resolution point clouds of dense almond trees. The method is evaluated on artificial and real datasets and works rather well on high quality data, but performs poorly on low resolution data with gaps and occlusions. Bayes Bayesian inference point cloud pointcloud 3D tree trees reconstruction Matlab Povray Arbaro model LIDAR Hough transform cylinder geometry heuristic Australia Australian Centre for Field Robotics spherical grid icosahedron likelihood intersection artificial data color encoded depth map coordinate map ray tracing Information Systems

1

Page generated in 0.1037 seconds