Return to search

Parameter free document stream classification. / CUHK electronic theses & dissertations collection

Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH demonstrates significant improvements over all of the existing heuristics. These favorable results indicated that both PFreeBT and PNLH are highly effective and feasible. / For the problem of bursty topics identification, PFreeBT adopts an approach, in which we term it as feature-pivot clustering approach. Given a document stream, PFreeBT first identifies a set of bursty features from there. The identification process is based on computing the probability distributions. According to the patterns of the bursty features and two newly defined concepts (equivalent and map-to), a set of bursty topics can be extracted. / For the problem of constructing a reliable classifier, we formulate it as a partially supervised classification problem. In this classification problem, only a few training examples are labeled as positive (P). All other training examples (U) are remained unlabeled. Here, U is mixed with the negative examples (N) and some other positive examples (P'). Existing techniques that tackle this problem all focus on finding N from U. None of them attempts to extract P' from U. In fact, it is difficult to succeed as the topics in U are diverse and the features in there are sparse. In this dissertation, PNLH is proposed for extracting a high quality of P' and N from U. / In this dissertation, two heuristics, PFreeBT and PNLH, are proposed to tackle the aforementioned problems. PFreeBT aims at identifying the bursty topics in a document stream, whereas PNLH aims at constructing a reliable classifier for a given bursty topic. It is worth noting that both heuristics are parameter free. Users do not need to provide any parameter explicitly. All of the required variables can be computed base on the given document stream automatically. / In this information overwhelming century, information becomes ever more pervasive. A new class of data-intensive application arises where data is modeled best as an open-ended stream. We call such kind of data as data stream. Document stream is a variation of data stream, which consists of a sequence of chronological ordered documents. A fundamental problem of mining document streams is to extract meaningful structure from there, so as to help us to organize the contents systematically. In this dissertation, we focus on such a problem. Specifically, this dissertation studies two problems: to identify the bursty topics in a document stream and to construct a classifiers for the bursty topics. A bursty topic is one of the topics resides in the document stream, such that a large number of documents would be related to it during a bounded time interval. / Fung Pui Cheong Gabriel. / "August 2006." / Adviser: Jeffrey Xu Yu. / Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1720. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. / Includes bibliographical references (p. 122-130). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_343915
Date January 2006
ContributorsFung, Pui Cheong Gabriel., Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management.
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, theses
Formatelectronic resource, microform, microfiche, 1 online resource (xvi, 130 p. : ill.)
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0788 seconds