Global ETD Search

351	Data complexity in supervised learning: A far-reaching implication Macià Antolínez, Núria 06 October 2011 (has links) Aquesta tesi estudia la complexitat de les dades i el seu rol en la definició del comportament de les tècniques d'aprenentatge supervisat, i alhora explora la generació artificial de conjunts de dades mitjançant estimadors de complexitat. El treball s'ha construït sobre quatre principis que s'han succeït de manera natural. (1) La crítica de la metodologia actual utilitzada per la comunitat científica per avaluar el rendiment de nous sistemes d'aprenentatge ha desencadenat (2) l'interès per estimadors alternatius basats en l'anàlisi de la complexitat de les dades i el seu estudi. Ara bé, tant l'estat primerenc de les mesures de complexitat com la disponibilitat limitada de problemes del món real per fer el seu test han inspirat (3) la generació sintètica de problemes, la qual ha esdevingut l'eix central de la tesi, i (4) la proposta de fer servir estàndards artificials amb semblança als problemes reals. L'objectiu que es persegueix a llarg termini amb aquesta recerca és proporcionar als usuaris (1) unes directrius per escollir el sistema d'aprenentatge idoni per resoldre el seu problema i (2) una col•lecció de problemes per, o bé avaluar el rendiment dels sistemes d'aprenentatge, o bé provar les seves limitacions. / Esta tesis profundiza en el estudio de la complejidad de los datos y su papel en la definición del comportamiento de las técnicas de aprendizaje supervisado, a la vez que explora la generación artificial de conjuntos de datos mediante estimadores de complejidad. El trabajo se ha construido sobre cuatro pilares que se han sucedido de manera natural. (1) La crítica de la metodología actual utilizada por la comunidad científica para evaluar el rendimiento de nuevos sistemas de aprendizaje ha desatado (2) el interés por estimadores alternativos basados en el análisis de la complejidad de los datos y su estudio. Sin embargo, tanto el estado primerizo de las medidas de complejidad como la limitada disponibilidad de problemas del mundo real para su testeo han inspirado (3) la generación sintética de problemas, considerada el eje central de la tesis, y (4) la propuesta del uso de estándares artificiales con parecido a los problemas reales. El objetivo que se persigue a largo plazo con esta investigación es el de proporcionar a los usuarios (1) unas pautas pare escoger el sistema de aprendizaje más idóneo para resolver su problema y (2) una colección de problemas para evaluar el rendimiento de los sistemas de aprendizaje o probar sus limitaciones. / This thesis takes a close view of data complexity and its role shaping the behaviour of machine learning techniques in supervised learning and explores the generation of synthetic data sets through complexity estimates. The work has been built upon four principles which have naturally followed one another. (1) A critique about the current methodologies used by the machine learning community to evaluate the performance of new learners unleashes (2) the interest for alternative estimates based on the analysis of data complexity and its study. However, both the early stage of the complexity measures and the limited availability of real-world problems for testing inspire (3) the generation of synthetic problems, which becomes the backbone of this thesis, and (4) the proposal of artificial benchmarks resembling real-world problems. The ultimate goal of this research flow is, in the long run, to provide practitioners (1) with some guidelines to choose the most suitable learner given a problem and (2) with a collection of benchmarks to either assess the performance of the learners or test their limitations. Mineria de dades Aprenentatge supervisat Complexitat de les dades Minería de datos Aprendizaje superivisado Complejidad de los datos Data mining Supervised learning Data complexity Les TIC i la seva Gestió 004
352	Learning without labels and nonnegative tensor factorization Balasubramanian, Krishnakumar 08 April 2010 (has links) Supervised learning tasks like building a classifier, estimating the error rate of the predictors, are typically performed with labeled data. In most cases, obtaining labeled data is costly as it requires manual labeling. On the other hand, unlabeled data is available in abundance. In this thesis, we discuss methods to perform supervised learning tasks with no labeled data. We prove consistency of the proposed methods and demonstrate its applicability with synthetic and real world experiments. In some cases, small quantities of labeled data maybe easily available and supplemented with large quantities of unlabeled data (semi-supervised learning). We derive the asymptotic efficiency of generative models for semi-supervised learning and quantify the effect of labeled and unlabeled data on the quality of the estimate. Another independent track of the thesis is efficient computational methods for nonnegative tensor factorization (NTF). NTF provides the user with rich modeling capabilities but it comes with an added computational cost. We provide a fast algorithm for performing NTF using a modified active set method called block principle pivoting method and demonstrate its applicability to social network analysis and text mining. Unsupervised Supervised Latent vatiable Classification Regression Tensor Nonnegative Block principal pivoting ANLS Machine learning Artificial intelligence Supervised learning (Machine learning) Calculus of tensors
353	Semi-Supervised Classification Using Gaussian Processes Patel, Amrish 01 1900 (has links) Gaussian Processes (GPs) are promising Bayesian methods for classiﬁcation and regression problems. They have also been used for semi-supervised classiﬁcation tasks. In this thesis, we propose new algorithms for solving semi-supervised binary classiﬁcation problem using GP regression (GPR) models. The algorithms are closely related to semi-supervised classiﬁcation based on support vector regression (SVR) and maximum margin clustering. The proposed algorithms are simple and easy to implement. Also, the hyper-parameters are estimated without resorting to expensive cross-validation technique. The algorithm based on sparse GPR model gives a sparse solution directly unlike the SVR based algorithm. Use of sparse GPR model helps in making the proposed algorithm scalable. The results of experiments on synthetic and real-world datasets demonstrate the eﬃcacy of proposed sparse GP based algorithm for semi-supervised classiﬁcation. Classification (A I) Gaussian Processes Gaussian Process Regression (GPR) Support Vector Regression (SVR) Classification Models Semi-supervised Learning Computer Science
354	Large-scale semi-supervised learning for natural language processing Bergsma, Shane A Unknown Date No description available. natural language processing semi-supervised learning NLP web-scale N-gram selectional preference string similarity non-referential pronoun pleonastic pronoun non-anaphoric pronoun computational linguistics
355	Development of Partially Supervised Kernel-based Proximity Clustering Frameworks and Their Applications Graves, Daniel Unknown Date No description available. Partially supervised learning Fuzzy clustering Proximity hints Kernel-based clustering Active learning Multi-proximity clustering Time series analysis Time series clustering Structural musical segmentation Graph clustering
356	Adaptive Graph-Based Algorithms for Conditional Anomaly Detection and Semi-Supervised Learning Valko, Michal 01 August 2011 (has links) (PDF) We develop graph-based methods for semi-supervised learning based on label propagation on a data similarity graph. When data is abundant or arrive in a stream, the problems of computation and data storage arise for any graph-based method. We propose a fast approximate online algorithm that solves for the harmonic solution on an approximate graph. We show, both empirically and theoretically, that good behavior can be achieved by collapsing nearby points into a set of local representative points that minimize distortion. Moreover, we regularize the harmonic solution to achieve better stability properties. We also present graph-based methods for detecting conditional anomalies and apply them to the identification of unusual clinical actions in hospitals. Our hypothesis is that patient-management actions that are unusual with respect to the past patients may be due to errors and that it is worthwhile to raise an alert if such a condition is encountered. Conditional anomaly detection extends standard unconditional anomaly framework but also faces new problems known as fringe and isolated points. We devise novel nonparametric graph-based methods to tackle these problems. Our methods rely on graph connectivity analysis and soft harmonic solution. Finally, we conduct an extensive human evaluation study of our conditional anomaly methods by 15 experts in critical care. [STAT:OT] Statistics/Other Statistics [STAT:OT] Statistiques/Autres Machine Learning Anomaly Detection Graph-Based Learning Online Learning Adaptive Learning Semi-Supervised Learning
357	A General System for Supervised Biomedical Image Segmentation Chen, Cheng 15 March 2013 (has links) Image segmentation is important with applications to several problems in biology and medicine. While extensively researched, generally, current segmentation methods perform adequately in the applications for which they were designed, but often require extensive modifications or calibrations before used in a different application. We describe a system that, with few modifications, can be used in a variety of image segmentation problems. The system is based on a supervised learning strategy that utilizes intensity neighborhoods to assign each pixel in a test image its correct class based on training data. In summary, we have several innovations: (1) A general framework for such a system is proposed, where rotations and variations of intensity neighborhoods in scales are modeled, and a multi-scale classification framework is utilized to segment unknown images; (2) A fast algorithm for training data selection and pixel classification is presented, where a majority voting based criterion is proposed for selecting a small subset from raw training set. When combined with 1-nearest neighbor (1-NN) classifier, such an algorithm is able to provide descent classification accuracy within reasonable computational complexity. (3) A general deformable model for optimization of segmented regions is proposed, which takes the decision values from previous pixel classification process as input, and optimize the segmented regions in a partial differential equation (PDE) framework. We show that the performance of this system in several different biomedical applications, such as tissue segmentation tasks in magnetic resonance and histopathology microscopy images, as well as nuclei segmentation from fluorescence microscopy images, is similar or better than several algorithms specifically designed for each of these applications. In addition, we describe another general segmentation system for biomedical applications where a strong prior on shape is available (e.g. cells, nuclei). The idea is based on template matching and supervised learning, and we show the examples of segmenting cells and nuclei from microscopy images. The method uses examples selected by a user for building a statistical model which captures the texture and shape variations of the nuclear structures from a given data set to be segmented. Segmentation of subsequent, unlabeled, images is then performed by finding the model instance that best matches (in the normalized cross correlation sense) local neighborhood in the input image. We demonstrate the application of our method to segmenting cells and nuclei from a variety of imaging modalities, and quantitatively compare our results to several other methods. Quantitative results using both simulated and real image data show that, while certain methods may work well for certain imaging modalities, our software is able to obtain high accuracy across several imaging modalities studied. Results also demonstrate that, relative to several existing methods, the template based method we propose presents increased robustness in the sense of better handling variations in illumination, variations in texture from different imaging modalities, providing more smooth and accurate segmentation borders, as well as handling better cluttered cells and nuclei. image segmentation supervised learning pixel classification intensity neighborhood data selection majority voting statistical modeling template matching non-rigid registration
358	Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease Duan, Haoyang 15 May 2014 (has links) From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset. SNPs GWAS Data Science Mass Transportation Distance Dimensionality Reduction Random Projections Supervised Learning Theory Coronary Artery Disease K-Nearest Neighbour Classifier Universal Consistency
359	Weakly Supervised Learning for Structured Output Prediction Kumar, M. Pawan 12 December 2013 (has links) (PDF) We consider the problem of learning the parameters of a structured output prediction model, that is, learning to predict elements of a complex interdependent output space that correspond to a given input. Unlike many of the existing approaches, we focus on the weakly supervised setting, where most (or all) of the training samples have only been partially annotated. Given such a weakly supervised dataset, our goal is to estimate accurate parameters of the model by minimizing the regularized empirical risk, where the risk is measured by a user-specified loss function. This task has previously been addressed by the well-known latent support vector machine (latent SVM) framework. We argue that, while latent SVM offers a computational efficient solution to loss-based weakly supervised learning, it suffers from the following three drawbacks: (i) the optimization problem corresponding to latent SVM is a difference-of-convex program, which is non-convex, and hence susceptible to bad local minimum solutions; (ii) the prediction rule of latent SVM only relies on the most likely value of the latent variables, and not the uncertainty in the latent variable values; and (iii) the loss function used to measure the risk is restricted to be independent of true (unknown) value of the latent variables. We address the the aforementioned drawbacks using three novel contributions. First, inspired by human learning, we design an automatic self-paced learning algorithm for latent SVM, which builds on the intuition that the learner should be presented in the training samples in a meaningful order that facilitates learning: starting frome easy samples and gradually moving to harder samples. Our algorithm simultaneously selects the easy samples and updates the parameters at each iteration by solving a biconvex optimization problem. Second, we propose a new family of LVMs called max-margin min-entropy (M3E) models, which includes latent SVM as a special case. Given an input, an M3E model predicts the output with the smallest corresponding Renyi entropy of generalized distribution, which relies not only on the probability of the output but also the uncertainty of the latent variable values. Third, we propose a novel learning framework for learning with general loss functions that may depend on the latent variables. Specifically, our framework simultaneously estimates two distributions: (i) a conditional distribution to model the uncertainty of the latent variables for a given input-output pair; and (ii) a delta distribution to predict the output and the latent variables for a given input. During learning, we encourage agreement between the two distributions by minimizing a loss-based dissimilarity coefficient. We demonstrate the efficacy of our contributions on standard machine learning applications using publicly available datasets. structured output prediction weakly supervised learning latent svm max-margin min-entropy dissimilarity coefficient
360	Αυτόματη παραγωγή έμπειρων συστημάτων με συντελεστές βεβαιότητας από σύνολα δεδομένων / Automatic generation of expert systems with certainty factors from datasets Κόβας, Κωνσταντίνος 11 August 2011 (has links) Σκοπός της συγκεκριμένης εργασίας είναι η έρευνα πάνω στον τομέα της αυτόματης παραγωγής έμπειρων συστημάτων, ανακαλύπτοντας γνώση μέσα σε σύνολα δεδομένων και αναπαριστώντας την με την μορφή κανόνων. Ουσιαστικά πρόκειται για μια μέθοδο επιτηρούμενης μάθησης όπως η εξόρυξη κανόνων ταξινόμησης, ωστόσο ο στόχος δεν είναι αποκλειστικά η ταξινόμηση, αλλά και η τήρηση σημαντικών προδιαγραφών ενός έμπειρου συστήματος όπως η επεξήγηση, η ενημέρωση για νέα δεδομένα κ.α. Στα πλαίσια της προπτυχιακής μου εργασίας αναπτύχθηκε ένα εργαλείο που είχε σκοπό την σύγκριση μεθόδων για συνδυασμό αβέβαιων συμπερασμάτων για το ίδιο γεγονός, στο μοντέλο των Συντελεστών Βεβαιότητας. Το εργαλείο έδινε την δυνατότητα να παραχθούν Έμπειρα Συστήματα (στη γλώσσα CLIPS) που χρησιμοποιούν τις παραπάνω μεθόδους. Σκοπός της παρούσας εργασίας ήταν η διερεύνηση του τομέα της μηχανικής μάθησης και η επέκταση του υπάρχοντος εργαλείου, ώστε να παράγει έμπειρα συστήματα με έναν πιο αυτόματο, αποδοτικό και λειτουργικό τρόπο. Πιο συγκεκριμένα τροποποιήθηκε η αρχιτεκτονική για την υποστήριξη μεταβλητών εξόδου με περισσότερες από δυο κλάσεις (Multiclass Classification). Επίσης έγινε επέκταση ώστε να μπορούν να εξαχθούν κανόνες για περισσότερες μεταβλητές του συνόλου δεδομένων (εκτός δηλαδή από την μεταβλητή εξόδου), για τις οποίες δεν χρειάζεται πλέον να γνωρίζει τιμές ο τελικός χρήστης του έμπειρου συστήματος. Η επέκταση αυτή δίνει την δυνατότητα να σχεδιαστούν πιο πολύπλοκες ιεραρχίες κανόνων, που ακολουθούν μια δενδρική δομή, εύκολα ερμηνεύσιμη από τον άνθρωπο. Το μοντέλο συντελεστών βεβαιότητας επανασχεδιάστηκε, ενώ πλέον προσφέρεται και ένας εναλλακτικός τρόπος υπολογισμού των συντελεστών βεβαιότητας των κανόνων ταξινόμησης ο οποίος βασίζεται στον ορισμό τους στο έμπειρο σύστημα MYCIN. Τα αποτελέσματα έδειξαν ότι σε μη ισορροπημένα σύνολα δεδομένων η μέθοδος αυτή ευνοεί την πρόβλεψη για την κλάση μειοψηφίας. Τεχνικές επιλογής υποσυνόλων χαρακτηριστικών, δίνουν την δυνατότητα αυτοματοποίησης σε μεγάλο βαθμό της διαδικασίας παραγωγής του έμπειρου συστήματος με τρόπο αποδοτικό. Άλλες προσθήκες είναι η δυνατότητα δημιουργίας συστημάτων που μπορούν να ενημερώνονται δυναμικά αξιοποιώντας νέα δεδομένα για το πρόβλημα, η παραγωγή κανόνων και συναρτήσεων για την αλληλεπίδραση με τον χρήστη, η παροχή γραφικού περιβάλλοντος για το παραγόμενο έμπειρο σύστημα κ.α. / The main objective of this thesis is to present a method for automatic generation of expert systems, by extracting knowledge from datasets and representing it in the form of production rules. We use a supervised machine learning method, resembling Classification Rule Mining, although classification is not our only goal. Important operational characteristics of expert systems, like explanation of conclusions and dynamic update of the knowledge base, are also taken into account. Our approach is implemented within an existing tool, initially developed by us to compare methods for combining uncertain conclusions about the same event, based on the uncertainty model of Certainty Factors. That tool could generate Expert Systems (in CLIPS language) that use the above methods. The main aim of this thesis is to do research mainly on the field of machine learning in order to enhance the above mentioned tool for generating Expert Systems in a more automatic, efficient and functional fashion. More specifically, the architecture has been modified to support output variables classified in more than two classes (Multiclass Classification). An extension of the system made it possible to generate classification rules for additional variables (apart from the output variable), for which the final user of the expert system cannot provide values. This gives the ability to design more complex rule hierarchies, which are represented in an easy-to-understand tree form. Furthermore, the certainty factors model has been revised and an additional method of computing them is offered, following the definitions in MYCIN’s model. Experimental results showed improved performance, especially for prediction of minority classes in imbalanced datasets. Feature ranking and subset selection techniques help to achieve the generation task in a more automatic and efficient way. Other enhancements include the ability to produce expert systems that dynamically update the certainty factors in their rules, the generation of rules and functions for interaction with the end-user and a graphical interface for the produced expert system. Έμπειρα συστήματα Επιτηρούμενη μάθηση 006.338 Expert systems Certainty factors Classification rule mining Supervised learning Feature selection

Search results