Global ETD Search

1	Induction of Classifiers from Multi-labeled Examples: an Information-retrieval Point of View Sarinnapakorn, Kanoksri 21 December 2007 (has links) An important task of information retrieval is to induce classifiers capable of categorizing text documents. The fact that the same document can simultaneously belong to two or more categories is referred by the term multi-label classification (or categorization). Domains of this kind have been encountered in diverse fields even outside information retrieval. This dissertation discusses one challenging aspect of text categorization: the documents (i.e., training examples) are characterized by an extremely large number of features. As a result, many existing machine learning techniques are in such domains prohibitively expensive. This dissertation seeks to reduce these costs significantly. The proposed scheme consists of two steps. The first runs a so-called baseline induction algorithm (BIA) separately on different versions of the data, each time inducing a different subclassifier---more specifically, BIA is run always on the same training documents that are each time described by a different subset of the features. The second step then combines the subclassifiers by a fusion algorithm: when a document is to be classified, each subclassifier outputs a set of class labels accompanied by its confidence in these labels; these outputs are then combined into a single multi-label recommendation. The dissertation investigates a few alternative fusion techniques, including an original one, inspired by the Dempster-Shafer Theory. The main contribution is a mechanism for assigning the mass function to individual labels from subclassifiers. The system's behavior is illustrated on two real-world data sets. As indicated, in each of them the examples are described by thousands of features, and each example is labeled with a subset of classes. Experimental evidence indicates that the method can scale up well and achieves impressive computational savings in exchange for only a modest loss in the classification performance. The fusion method proposed is also shown to be more accurate than other more traditional fusion mechanisms. For a very large multi-label data set, the proposed mechanism not only speeds up the total induction time, but also facilitates the execution of the task on a small computer. The fact that subclassifiers can be constructed independently and more conveniently from small subsets of features provides an avenue for parallel processing that might offer further increase in computational efficiency.
2	Classification of HTML Documents Xie, Wei January 2006 (has links) Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement. / Master of Computing HTML Classification World Wide Web Internet Support Vector Machines BoosTexter Australian Digital Thesis
3	Classification of HTML Documents Xie, Wei . University of Ballarat. January 2006 (has links) Text Classification is the task of mapping a document into one or more classes based on the presence or absence of words (or features) in the document. It is intensively being studied and different classification techniques and algorithms have been developed. This thesis focuses on classification of online documents that has become more critical with the development of World Wide Web. The WWW vastly increases the availability of on-line documents in digital format and has highlighted the need to classify them. From this background, we have noted the emergence of “automatic Web Classification”. These mainly concentrate on classifying HTML-like documents into classes or categories by not only using the methods that are inherited from the traditional Text Classification process, but also utilizing the extra information provided only by Web pages. Our work is based on the fact that, Web documents, contain not only ordinary features (words) but also extra information, such as meta-data and hyperlinks that can be used to advantage the classification process. The aim of this research is to study various ways of using the extra information, in particularly, hyperlink information provided by HTML-documents (Web pages). The merit of the approach, developed in this thesis, is its simplicity, compared with existing approaches. We present different approaches of using hyperlink information to improve the effectiveness of web classification. Unlike other work in this area, we will only use the mappings between linked documents and their own class or classes. In this case, we only need to add a few features called linked-class features into the datasets, and then apply classifiers on them for classification. In the numerical experiments we adopted two wellknown Text Classification algorithms, Support Vector Machines and BoosTexter. The results obtained show that classification accuracy can be improved by using mixtures of ordinary and linked-class features. Moreover, out-links usually work better than in-links in classification. We also analyse and discuss the reasons behind this improvement. / Master of Computing HTML Classification World Wide Web Internet Support Vector Machines BoosTexter Australian Digital Thesis

1

Page generated in 0.0822 seconds