With the exponential growth of data amount and sources, access to large collections of data has become easier and cheaper. However, data is generally unlabelled and labels are often difficult, expensive, and time consuming to obtain. Two learning paradigms have been used by machine learning community to diminish the need for labels in training data: semi-supervised learning (SSL) and active learning (AL). AL is a reliable way to efficiently building up training sets with minimal supervision. By querying the class (label) of the most interesting samples based upon previously seen data and some selection criteria, AL can produce a nearly optimal hypothesis, while requiring the minimum possible quantity of labelled data. SSL, on the other hand, takes the advantage of both labelled and unlabelled data to address the challenge of learning from a small number of labelled samples and large amount of unlabelled data. In this thesis, we borrow the concept of SSL by allowing AL algorithms to make use of redundant unlabelled data so that both labelled and unlabelled data are used in their querying criteria. Another common tradition within the AL community is to assume that data samples are already gathered in a pool and AL has the luxury to exhaustively search in that pool for the samples worth labelling. In this thesis, we go beyond that by applying AL to data streams. In a stream, data may grow infinitely making its storage prior to processing impractical. Due to its dynamic nature, the underlying distribution of the data stream may change over time resulting in the so-called concept drift or possibly emergence and fading of classes, known as concept evolution. Another challenge associated with AL, in general, is the sampling bias where the sampled training set does not reflect on the underlying data distribution. In presence of concept drift, sampling bias is more likely to occur as the training set needs to represent the underlying distribution of the evolving data. Given these challenges, the research questions that the thesis addresses are: can AL improve learning given that data comes in streams? Is it possible to harness AL to handle changes in streams (i.e., concept drift and concept evolution by querying selected samples)? How can sampling bias be attenuated, while maintaining AL advantages? Finally, applying AL for sequential data steams (like time series) requires new approaches especially in the presence of concept drift and concept evolution. Hence, the question is how to handle concept drift and concept evolution in sequential data online and can AL be useful in such case? In this thesis, we develop a set of stream-based AL algorithms to answer these questions in line with the aforementioned challenges. The core idea of these algorithms is to query samples that give the largest reduction of an expected loss function that measures the learning performance. Two types of AL are proposed: decision theory based AL whose losses involve the prediction error and information theory based AL whose losses involve the model parameters. Although, our work focuses on classification problems, AL algorithms for other problems such as regression and parameter estimation can be derived from the proposed AL algorithms. Several experiments have been performed in order to evaluate the performance of the proposed algorithms. The obtained results show that our algorithms outperform other state-of-the-art algorithms.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:725478 |
Date | January 2017 |
Creators | Mohamad, Saad |
Publisher | Bournemouth University |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://eprints.bournemouth.ac.uk/29901/ |
Page generated in 0.0019 seconds