Return to search

Pattern recognition using labelled and unlabelled data

This thesis presents the results of a three year investigation into combining labelled and unlabelled data for data classification. In the present world, there are many fields in which the quantity of data available to workers in that field has increased exponentially over the last few years. This has in part been due to improved methods of automatic data capture and in part due to improved electronic communication particularly via the internet. These vast quantities of data require some form of processing in order to transform the data into information. This is often a costly business requiring human (often expert) intervention. Our rationale for this investigation is that we wish to augment the information provided by the human experts with data which has not been processed by human experts. The actual method we investigate is classification using both processed (labelled) and unprocessed (unlabelled) data in order to reduce the requirement for human intervention. In Chapter 2 of the thesis we review several aspects of this problem as it features in the current literature. We discuss • Classification versus clustering • Error estimation - training, testing and validation data. • Generalisation • Existing methods for combining labelled and unlabelled data • Combining classifiers • Artificial neural networks • The sufficient level of labelled samples for a classification task. • Active selection These topics are revisited in the subsequent chapters in which we present our new work. We begin to introduce our novel work in Chapter 3 when we discuss 5 major approaches to combining labelled and unlabelled data to augment the classifier. The first and base classifier is trained only on the labelled data. Subsequent methods to improve this classifier include • static labelling: using the labelled data to create the classifier and then using this classifier to classify all the unlabelled data. The final training dataset is composed of the union of the originally labelled and newly labelled datasets. • dynamic labelling: incrementally retraining the classifier on a sample by sample basis. • majority clustering: the majority vote from the labelled samples in a cluster (found without using labels) determines the classification of new data. • semi-supervised clustering: the labels are actively used in the clustering process. We investigate a particular semi-supervised method which we call refined clustering: we perform clustering and then refine the clusters based on conflict levels of the labelled data in each cluster. We discuss how the method exhibits reduction in error through its effect on both bias and variance. We investigate methods of selecting which data points are to be labelled for the initial labelled dataset. In the next chapter, we discuss bagging, a method for combining classifiers and use the method on Kohonen's Self Organising Maps (SOMs). Bagging is typically performed with supervised classifiers but the SOM is an unsupervised topology preserving mapping which raises issues which do not normally arise with bagging. We discuss several refinements to the algorithm which enables us to confidently use the method with SOMs. Finally we discuss supervised and semi-supervised versions of the SOM in the context of bagging. In the next chapter, we consider the problem of estimating what fraction of a dataset must be labelled before we can have confidence in the classifier trained on this labelled dataset. We use sets of data points as a basis for each class in turn which allows us to minimise reconstruction error optimally for the members of that class but does not have this effect on members of other classes. We put these concepts into the framework of a negative feedback artificial neural network and show how separating projection and reconstruction stages enables us to cluster datasets and, perhaps more importantly, to visualise the structure of the datasets. In the final chapter of new work, we discuss active and interactive selection of data points for labelling. We are thus explicitly accepting the use of a human (but not human expert) in the classification process but are trying to optimise this input by automatically presenting the data so that the task is straightforward for the human. We specifically use the kernel matrices which have been so important in the development of Support Vector Machines (SVMs) and Kernel Principal Component analysis (KPCA) in a way which has not previously been envisaged. We retrieve the sparseness feature for Kernel PCA which exists for SVMs but is missing in standard KPCA.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:404644
Date January 2004
CreatorsPetrakieva, Lina
PublisherUniversity of the West of Scotland
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0024 seconds