1 |
Pattern recognition using labelled and unlabelled dataPetrakieva, Lina January 2004 (has links)
This thesis presents the results of a three year investigation into combining labelled and unlabelled data for data classification. In the present world, there are many fields in which the quantity of data available to workers in that field has increased exponentially over the last few years. This has in part been due to improved methods of automatic data capture and in part due to improved electronic communication particularly via the internet. These vast quantities of data require some form of processing in order to transform the data into information. This is often a costly business requiring human (often expert) intervention. Our rationale for this investigation is that we wish to augment the information provided by the human experts with data which has not been processed by human experts. The actual method we investigate is classification using both processed (labelled) and unprocessed (unlabelled) data in order to reduce the requirement for human intervention. In Chapter 2 of the thesis we review several aspects of this problem as it features in the current literature. We discuss • Classification versus clustering • Error estimation - training, testing and validation data. • Generalisation • Existing methods for combining labelled and unlabelled data • Combining classifiers • Artificial neural networks • The sufficient level of labelled samples for a classification task. • Active selection These topics are revisited in the subsequent chapters in which we present our new work. We begin to introduce our novel work in Chapter 3 when we discuss 5 major approaches to combining labelled and unlabelled data to augment the classifier. The first and base classifier is trained only on the labelled data. Subsequent methods to improve this classifier include • static labelling: using the labelled data to create the classifier and then using this classifier to classify all the unlabelled data. The final training dataset is composed of the union of the originally labelled and newly labelled datasets. • dynamic labelling: incrementally retraining the classifier on a sample by sample basis. • majority clustering: the majority vote from the labelled samples in a cluster (found without using labels) determines the classification of new data. • semi-supervised clustering: the labels are actively used in the clustering process. We investigate a particular semi-supervised method which we call refined clustering: we perform clustering and then refine the clusters based on conflict levels of the labelled data in each cluster. We discuss how the method exhibits reduction in error through its effect on both bias and variance. We investigate methods of selecting which data points are to be labelled for the initial labelled dataset. In the next chapter, we discuss bagging, a method for combining classifiers and use the method on Kohonen's Self Organising Maps (SOMs). Bagging is typically performed with supervised classifiers but the SOM is an unsupervised topology preserving mapping which raises issues which do not normally arise with bagging. We discuss several refinements to the algorithm which enables us to confidently use the method with SOMs. Finally we discuss supervised and semi-supervised versions of the SOM in the context of bagging. In the next chapter, we consider the problem of estimating what fraction of a dataset must be labelled before we can have confidence in the classifier trained on this labelled dataset. We use sets of data points as a basis for each class in turn which allows us to minimise reconstruction error optimally for the members of that class but does not have this effect on members of other classes. We put these concepts into the framework of a negative feedback artificial neural network and show how separating projection and reconstruction stages enables us to cluster datasets and, perhaps more importantly, to visualise the structure of the datasets. In the final chapter of new work, we discuss active and interactive selection of data points for labelling. We are thus explicitly accepting the use of a human (but not human expert) in the classification process but are trying to optimise this input by automatically presenting the data so that the task is straightforward for the human. We specifically use the kernel matrices which have been so important in the development of Support Vector Machines (SVMs) and Kernel Principal Component analysis (KPCA) in a way which has not previously been envisaged. We retrieve the sparseness feature for Kernel PCA which exists for SVMs but is missing in standard KPCA.
|
2 |
Towards the identification of intent for error correctionJohnson, Ian January 2003 (has links)
No description available.
|
3 |
Low density parity check codingGuo, Feng January 2005 (has links)
No description available.
|
4 |
Combining structure and appearance in digital documents using XML and PDFHardy, Matthew R. B. January 2003 (has links)
No description available.
|
5 |
Data input for scientific visualizationMathers, Christian January 2004 (has links)
No description available.
|
6 |
Automated generation of personal data reports from relational databasesCawley, Benjamin Matthew January 2007 (has links)
This thesis presents a novel approach for extracting personal data and automatically generating Personal Data Reports (PDRs) from relational databases. Such PDRs can be used among other purposes for compliance with Subject Access Requests (SARs) of Data Protection Acts (DPAs). The proposed approach combines the use of graphs and SQL for the construction of PDRs and its rationale is based on the fact that some relations in a database, which we denote as RDS relations, hold information about Data Subjects (DSs) and relations linked around RDSs contain additional information about the particular DS. Three methods with different usability characteristics are introduced: 1) GDS Based Method and 2) By Schema Browsing Method which generate SAR PDRs and 3) T Based Method which generates General Purpose PDRs. The novelty of these methodologies is that they do not require any prior knowledge of either the database schema or of any query language by the users. The work described in this thesis contributes to the gap in the knowledge for DPA compliance as current data protection systems do not provide facilitates for generating personal data reports. The performance results of the ODS approach are presented together with precision and recall measures of the T Based Method. An optimization algorithm that reuses already found data which is based on heuristics and hash tables is employed and its effectiveness verified. We conclude that the ODS and schema browsing methods provide an effective solution and that the automated T Based approach is an effective alternative for generating general purpose data reports, giving an average f-score of 76.5%.
|
7 |
Convergence improvement of iterative decodersPapagiannis, Evangelos January 2006 (has links)
Iterative decoding techniques shaked the waters of the error correction and communications field in general. Their amazing compromise between complexity and performance offered much more freedom in code design and made highly complex codes, that were being considered undecodable until recently, part of almost any communication system. Nevertheless, iterative decoding is a sub-optimum decoding method and as such, it has attracted huge research interest. But the iterative decoder still hides many of its secrets, as it has not been possible yet to fully describe its behaviour and its cost function. This work presents the convergence problem of iterative decoding from various angles and explores methods for reducing any sub-optimalities on its operation. The decoding algorithms for both LDPC and turbo codes were investigated and aspects that contribute to convergence problems were identified. A new algorithm was proposed, capable of providing considerable coding gain in any iterative scheme. Moreover, it was shown that for some codes the proposed algorithm is sufficient to eliminate any sub-optimality and perform maximum likelihood decoding. Its performance and efficiency was compared to that of other convergence improvement schemes. Various conditions that can be considered critical to the outcome of the iterative decoder were also investigated and the decoding algorithm of LDPC codes was followed analytically to verify the experimental results.
|
8 |
An ontology-based approach to web site design and developmentLei, Yuangui January 2005 (has links)
No description available.
|
9 |
Accelerating data retrieval steps in XML documentsShen, Yun January 2005 (has links)
The aim of this research is to accelerate the data retrieval steps in a collection of XML (eXtensible Markup Language) documents, a key task of current XML research. The following three inter-connected issues relating to the state-of-theart XML research are thus studied: semantically clustering XML documents, efficiently querying XML document with an index structure and self-adaptively labelling dynamic XML documents, which form a basic but self-contained foundation of a native XML database system. This research is carried out by following a divide-and-conquer strategy. The issue of dividing a collection of XML documents into sub-clusters, in which semantically similar XML documents are grouped together, is addressed at first. To achieve this purpose, a semantic component model to model the implicit semantic of an XML document is proposed. This model enables us to devise a set of heuristic algorithms to' compute the degree of similarity among XML documents. In particular, the newly proposed semantic component model and the heuristic algorithms reflect the inaccuracy of the traditional edit-distance-based clustering mechanisms. After similar XML documents are grouped into sub-collections,the problem of querying XML documents with an index structure is carefully studied. A novel geometric sequence model is proposed to transform XML documents into numbered geometric sequences and XPath queries into geometric query sequences. The problem of evaluating an XPath query in an XML document is theoretically proved to be equal to the problem of finding the subsequence .matchings of a geometric query sequence in a numbered geometric document sequence. This geometric sequence model then enables us to devise two new stackbased algorithms to perform both top-down and bottom-up XPath evaluation in XML documents. In particular, the algorithms treat an XPath query as a whole unit, avoiding resource-consuming join operations and generating all the answers without semantic errors and false alarms. Finally the issue of supporting update functions in XML documents is tackled. A new Bayesian allocation model is introduced for the index structure generated in geometric sequence model. Based on k-ary tree data structure and the level traversal mechanism, the correctness and efficiency of the Bayesian allocation model in supporting dynamic XML documents is theoretically proved. In particular, the Bayesian allocation model is general and can be applied to most of the current index structures.
|
10 |
Data quality and data cleaning in database applicationsLi, Lin January 2012 (has links)
Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.
|
Page generated in 0.0227 seconds