Most existing semi-supervised document clustering approaches are model-based clustering and can be treated as parametric model taking an assumption that the underlying clusters follow a certain pre-defined distribution. In our semi-supervised document clustering, each cluster is represented by a non-parametric probability distribution. Two approaches are designed for incorporating pairwise constraints in the document clustering approach. The first approach, term-to-term relationship approach (TR), uses pairwise constraints for capturing term-to-term dependence relationships. The second approach, linear combination approach (LC), combines the clustering objective function with the user-provided constraints linearly. Extensive experimental results show that our proposed framework is effective. / This thesis presents a new framework for automatically partitioning text documents taking into consideration of constraints given by users. Semi-supervised document clustering is developed based on pairwise constraints. Different from traditional semi-supervised document clustering approaches which assume pairwise constraints to be prepared by user beforehand, we develop a novel framework for automatically discovering pairwise constraints revealing the user grouping preference. Active learning approach for choosing informative document pairs is designed by measuring the amount of information that can be obtained by revealing judgments of document pairs. For this purpose, three models, namely, uncertainty model, generation error model, and term-to-term relationship model, are designed for measuring the informativeness of document pairs from different perspectives. Dependent active learning approach is developed by extending the active learning approach to avoid redundant document pair selection. Two models are investigated for estimating the likelihood that a document pair is redundant to previously selected document pairs, namely, KL divergence model and symmetric model. / Huang, Ruizhang. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3600. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 117-123). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
Identifer | oai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_344288 |
Date | January 2008 |
Contributors | Huang, Ruizhang., Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management. |
Source Sets | The Chinese University of Hong Kong |
Language | English, Chinese |
Detected Language | English |
Type | Text, theses |
Format | electronic resource, microform, microfiche, 1 online resource (xi, 123 leaves : ill.) |
Rights | Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/) |
Page generated in 0.0031 seconds