Return to search

Incremental document clustering for web page classification.

by Wong, Wai-Chiu. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2000. / Includes bibliographical references (leaves 89-94). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgments --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Document Clustering --- p.2 / Chapter 1.2 --- DC-tree --- p.4 / Chapter 1.3 --- Feature Extraction --- p.5 / Chapter 1.4 --- Outline of the Thesis --- p.5 / Chapter 2 --- Related Work --- p.8 / Chapter 2.1 --- Clustering Algorithms --- p.8 / Chapter 2.1.1 --- Partitional Clustering Algorithms --- p.8 / Chapter 2.1.2 --- Hierarchical Clustering Algorithms --- p.10 / Chapter 2.2 --- Document Classification by Examples --- p.11 / Chapter 2.2.1 --- k-NN algorithm - Expert Network (ExpNet) --- p.11 / Chapter 2.2.2 --- Learning Linear Text Classifier --- p.12 / Chapter 2.2.3 --- Generalized Instance Set (GIS) algorithm --- p.12 / Chapter 2.3 --- Document Clustering --- p.13 / Chapter 2.3.1 --- B+-tree-based Document Clustering --- p.13 / Chapter 2.3.2 --- Suffix Tree Clustering --- p.14 / Chapter 2.3.3 --- Association Rule Hypergraph Partitioning Algorithm --- p.15 / Chapter 2.3.4 --- Principal Component Divisive Partitioning --- p.17 / Chapter 2.4 --- Projections for Efficient Document Clustering --- p.18 / Chapter 3 --- Background --- p.21 / Chapter 3.1 --- Document Preprocessing --- p.21 / Chapter 3.1.1 --- Elimination of Stopwords --- p.22 / Chapter 3.1.2 --- Stemming Technique --- p.22 / Chapter 3.2 --- Problem Modeling --- p.23 / Chapter 3.2.1 --- Basic Concepts --- p.23 / Chapter 3.2.2 --- Vector Model --- p.24 / Chapter 3.3 --- Feature Selection Scheme --- p.25 / Chapter 3.4 --- Similarity Model --- p.27 / Chapter 3.5 --- Evaluation Techniques --- p.29 / Chapter 4 --- Feature Extraction and Weighting --- p.31 / Chapter 4.1 --- Statistical Analysis of the Words in the Web Domain --- p.31 / Chapter 4.2 --- Zipf's Law --- p.33 / Chapter 4.3 --- Traditional Methods --- p.36 / Chapter 4.4 --- The Proposed Method --- p.38 / Chapter 4.5 --- Experimental Results --- p.40 / Chapter 4.5.1 --- Synthetic Data Generation --- p.40 / Chapter 4.5.2 --- Real Data Source --- p.41 / Chapter 4.5.3 --- Coverage --- p.41 / Chapter 4.5.4 --- Clustering Quality --- p.43 / Chapter 4.5.5 --- Binary Weight vs Numerical Weight --- p.45 / Chapter 5 --- Web Document Clustering Using DC-tree --- p.48 / Chapter 5.1 --- Document Representation --- p.48 / Chapter 5.2 --- Document Cluster (DC) --- p.49 / Chapter 5.3 --- DC-tree --- p.52 / Chapter 5.3.1 --- Tree Definition --- p.52 / Chapter 5.3.2 --- Insertion --- p.54 / Chapter 5.3.3 --- Node Splitting --- p.55 / Chapter 5.3.4 --- Deletion and Node Merging --- p.56 / Chapter 5.4 --- The Overall Strategy --- p.57 / Chapter 5.4.1 --- Preprocessing --- p.57 / Chapter 5.4.2 --- Building DC-tree --- p.59 / Chapter 5.4.3 --- Identifying the Interesting Clusters --- p.60 / Chapter 5.5 --- Experimental Results --- p.61 / Chapter 5.5.1 --- Alternative Similarity Measurement : Synthetic Data --- p.61 / Chapter 5.5.2 --- DC-tree Characteristics : Synthetic Data --- p.63 / Chapter 5.5.3 --- Compare DC-tree and B+-tree: Synthetic Data --- p.64 / Chapter 5.5.4 --- Compare DC-tree and B+-tree: Real Data --- p.66 / Chapter 5.5.5 --- Varying the Number of Features : Synthetic Data --- p.67 / Chapter 5.5.6 --- Non-Correlated Topic Web Page Collection: Real Data --- p.69 / Chapter 5.5.7 --- Correlated Topic Web Page Collection: Real Data --- p.71 / Chapter 5.5.8 --- Incremental updates on Real Data Set --- p.72 / Chapter 5.5.9 --- Comparison with the other clustering algorithms --- p.73 / Chapter 6 --- Conclusion --- p.75 / Appendix --- p.77 / Chapter A --- Stopword List --- p.77 / Chapter B --- Porter's Stemming Algorithm --- p.81 / Chapter C --- Insertion Algorithm --- p.83 / Chapter D --- Node Splitting Algorithm --- p.85 / Chapter E --- Features Extracted in Experiment 4.53 --- p.87 / Bibliography --- p.88

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_323115
Date January 2000
ContributorsWong, Wai-Chiu., Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, bibliography
Formatprint, xii, 94 leaves : ill. ; 30 cm.
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0027 seconds