In many text mining applications, knowledge bases incorporating expert knowledge are beneficial for intelligent decision making. Refining an existing knowledge base from a source domain to a different target domain solving the same task would greatly reduce the effort required for preparing labeled training data in constructing a new knowledge base. We investigate a new framework of refining a kind of logic knowledge base known as Markov Logic Networks (MLN). One characteristic of this adaptation problem is that since the data distributions of the two domains are different, there should be different tailor-made MLNs for each domain. On the other hand, the two knowledge bases should share certain amount of similarities due to the same goal. We investigate the refinement in two situations, namely, using unlabeled target domain data, and using limited amount of labeled target domain data. / When manual annotation of a limited amount of target domain data is possible, we exploit how to actively select the data for annotation and develop two active learning approaches. The first approach is a pool-based active learning approach taking into account of the differences between the source and the target domains. A theoretical analysis on the sampling bound of the approach is conducted to demonstrate that informative data can be actively selected. The second approach is an error-driven approach that is designed to provide estimated labels for the target domain and hence the quality of the logic formulae captured for the target domain can be improved. An error analysis on the cluster-based active learning approach is presented. We have conducted extensive experiments on two different text mining tasks, namely, pronoun resolution and segmentation of citation records, showing consistent ii improvements in both situations of using unlabeled target domain data, and with a limited amount of labeled target domain data. / When there is no manual label given for the target domain data, we re-fine an existing MLN via two components. The first component is the logic formula weight adaptation that jointly maximizes the likelihood of the observations of the target domain unlabeled data and considers the differences between the two domains. Two approaches are designed to capture the differences between the two domains. One approach is to analyze the distribution divergence between the two domains and the other approach is to incorporate a penalized degree of difference. The second component is logic formula refinement where logic formulae specific to the target domain are discovered to further capture the characteristics of the target domain. / Chan, Ki Cecia. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 73-02, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 120-128). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
Identifer | oai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_344655 |
Date | January 2010 |
Contributors | Chan, Ki Cecia., Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management. |
Source Sets | The Chinese University of Hong Kong |
Language | English, Chinese |
Detected Language | English |
Type | Text, theses |
Format | electronic resource, microform, microfiche, 1 online resource (xiii, 130 leaves : ill.) |
Rights | Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/) |
Page generated in 0.0126 seconds