1 |
Optimal Active Learning: experimental factors and membership query learningYu-hui Yeh Unknown Date (has links)
The field of Machine Learning is concerned with the development of algorithms, models and techniques that solve challenging computational problems by learning from data representative of the problem (e.g. given a set of medical images previously classified by a human expert, build a model to predict unseen images as either benign or malignant). Many important real-world problems have been formulated as supervised learning problems. The assumption is that a data set is available containing the correct output (e.g. class label or target value) for each given data point. In many application domains, obtaining the correct outputs (labels) for data points is a costly and time-consuming task. This has provided the motivation for the development of Machine Learning techniques that attempt to minimize the number of labeled data points while maintaining good generalization performance on a given problem. Active Learning is one such class of techniques and is the focus of this thesis. Active Learning algorithms select or generate unlabeled data points to be labeled and use these points for learning. If successful, an Active Learning algorithm should be able to produce learning performance (e.g test set error) comparable to an equivalent supervised learner using fewer labeled data points. Theoretical, algorithmic and experimental Active Learning research has been conducted and a number of successful applications have been demonstrated. However, the scope of many of the experimental studies on Active Learning has been relatively small and there are very few large-scale experimental evaluations of Active Learning techniques. A significant amount of performance variability exists across Active Learning experimental results in the literature. Furthermore, the implementation details and effects of experimental factors have not been closely examined in empirical Active Learning research, creating some doubt over the strength and generality of conclusions that can be drawn from such results. The Active Learning model/system used in this thesis is the Optimal Active Learning algorithm framework with Gaussian Processes for regression problems (however, most of the research questions are of general interest in many other Active Learning scenarios). Experimental and implementation details of the Active Learning system used are described in detail, using a number of regression problems and datasets of different types. It is shown that the experimental results of the system are subject to significant variability across problem datasets. The hypothesis that experimental factors can account for this variability is then investigated. The results show the impact of sampling and sizes of the datasets used when generating experimental results. Furthermore, preliminary experimental results expose performance variability across various real-world regression problems. The results suggest that these experimental factors can (to a large extent) account for the variability observed in experimental results. A novel resampling technique for Optimal Active Learning, called '3-Sets Cross-Validation', is proposed as a practical solution to reduce experimental performance variability. Further results confirm the usefulness of the technique. The thesis then proposes an extension to the Optimal Active Learning framework, to perform learning via membership queries via a novel algorithm named MQOAL. The MQOAL algorithm employs the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method to sample data points for query selection. Experimental results show that MQOAL provides comparable performance to the pool-based OAL learner, using a very generic, simple MCMC technique, and is robust to experimental factors related to the MCMC implementation. The possibility of making queries in batches is also explored experimentally, with results showing that while some performance degradation does occur, it is minimal for learning in small batch sizes, which is likely to be valuable in some real-world problem domains.
|
2 |
Optimal Active Learning: experimental factors and membership query learningYu-hui Yeh Unknown Date (has links)
The field of Machine Learning is concerned with the development of algorithms, models and techniques that solve challenging computational problems by learning from data representative of the problem (e.g. given a set of medical images previously classified by a human expert, build a model to predict unseen images as either benign or malignant). Many important real-world problems have been formulated as supervised learning problems. The assumption is that a data set is available containing the correct output (e.g. class label or target value) for each given data point. In many application domains, obtaining the correct outputs (labels) for data points is a costly and time-consuming task. This has provided the motivation for the development of Machine Learning techniques that attempt to minimize the number of labeled data points while maintaining good generalization performance on a given problem. Active Learning is one such class of techniques and is the focus of this thesis. Active Learning algorithms select or generate unlabeled data points to be labeled and use these points for learning. If successful, an Active Learning algorithm should be able to produce learning performance (e.g test set error) comparable to an equivalent supervised learner using fewer labeled data points. Theoretical, algorithmic and experimental Active Learning research has been conducted and a number of successful applications have been demonstrated. However, the scope of many of the experimental studies on Active Learning has been relatively small and there are very few large-scale experimental evaluations of Active Learning techniques. A significant amount of performance variability exists across Active Learning experimental results in the literature. Furthermore, the implementation details and effects of experimental factors have not been closely examined in empirical Active Learning research, creating some doubt over the strength and generality of conclusions that can be drawn from such results. The Active Learning model/system used in this thesis is the Optimal Active Learning algorithm framework with Gaussian Processes for regression problems (however, most of the research questions are of general interest in many other Active Learning scenarios). Experimental and implementation details of the Active Learning system used are described in detail, using a number of regression problems and datasets of different types. It is shown that the experimental results of the system are subject to significant variability across problem datasets. The hypothesis that experimental factors can account for this variability is then investigated. The results show the impact of sampling and sizes of the datasets used when generating experimental results. Furthermore, preliminary experimental results expose performance variability across various real-world regression problems. The results suggest that these experimental factors can (to a large extent) account for the variability observed in experimental results. A novel resampling technique for Optimal Active Learning, called '3-Sets Cross-Validation', is proposed as a practical solution to reduce experimental performance variability. Further results confirm the usefulness of the technique. The thesis then proposes an extension to the Optimal Active Learning framework, to perform learning via membership queries via a novel algorithm named MQOAL. The MQOAL algorithm employs the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method to sample data points for query selection. Experimental results show that MQOAL provides comparable performance to the pool-based OAL learner, using a very generic, simple MCMC technique, and is robust to experimental factors related to the MCMC implementation. The possibility of making queries in batches is also explored experimentally, with results showing that while some performance degradation does occur, it is minimal for learning in small batch sizes, which is likely to be valuable in some real-world problem domains.
|
Page generated in 0.1091 seconds