51 |
"Aplicação de técnicas de data mining em logs de servidores web"Chiara, Ramon 09 May 2003 (has links)
Com o advento da Internet, as empresas puderam mostrar-se para o mundo. A possibilidade de colocar um negócio na World Wide Web (WWW) criou um novo tipo de dado que as empresas podem utilizar para melhorar ainda mais seu conhecimento sobre o mercado: a sequência de cliques que um usuário efetua em um site. Esse dado pode ser armazenado em uma espécie de Data Warehouse para ser analisado com técnicas de descoberta de conhecimento em bases de dados. Assim, há a necessidade de se realizar pesquisas para mostrar como retirar conhecimento a partir dessas sequências de cliques. Neste trabalho são discutidas e analisadas algumas das técnicas utilizadas para atingir esse objetivo. é proposta uma ferramenta onde os dados dessas sequências de cliques são mapeadas para o formato atributo-valor utilizado pelo Sistema Discover, um sistema sendo desenvolvindo em nosso Laboratório para o planejamento e execução de experimentos relacionados aos algoritmos de aprendizado utilizados durante a fase de Mineração de Dados do processo de descoberta de conhecimento em bases de dados. Ainda, é proposta a utilização do sistema de Programação Lógica Indutiva chamado Progol para extrair conhecimento relacional das sessões de sequências de cliques que caracterizam a interação de usuários com as páginas visitadas no site. Experimentos iniciais com a utilização de uma sequência de cliques real foram realizados usando Progol e algumas das facilidades já implementadas pelo Sistema Discover.
|
52 |
Discovering temporal patterns for interval-based events.January 2000 (has links)
Kam, Po-shan. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2000. / Includes bibliographical references (leaves 89-97). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgements --- p.ii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Data Mining --- p.1 / Chapter 1.2 --- Temporal Data Management --- p.2 / Chapter 1.3 --- Temporal reasoning and temporal semantics --- p.3 / Chapter 1.4 --- Temporal Data Mining --- p.5 / Chapter 1.5 --- Motivation --- p.6 / Chapter 1.6 --- Approach --- p.7 / Chapter 1.6.1 --- Focus and Objectives --- p.8 / Chapter 1.6.2 --- Experimental Setup --- p.8 / Chapter 1.7 --- Outline and contributions --- p.9 / Chapter 2 --- Relevant Work --- p.10 / Chapter 2.1 --- Data Mining --- p.10 / Chapter 2.1.1 --- Association Rules --- p.13 / Chapter 2.1.2 --- Classification --- p.15 / Chapter 2.1.3 --- Clustering --- p.16 / Chapter 2.2 --- Sequential Pattern --- p.17 / Chapter 2.2.1 --- Frequent Patterns --- p.18 / Chapter 2.2.2 --- Interesting Patterns --- p.20 / Chapter 2.2.3 --- Granularity --- p.21 / Chapter 2.3 --- Temporal Database --- p.21 / Chapter 2.4 --- Temporal Reasoning --- p.23 / Chapter 2.4.1 --- Natural Language Expression --- p.24 / Chapter 2.4.2 --- Temporal Logic Approach --- p.25 / Chapter 2.5 --- Temporal Data Mining --- p.25 / Chapter 2.5.1 --- Framework --- p.25 / Chapter 2.5.2 --- Temporal Association Rules --- p.26 / Chapter 2.5.3 --- Attribute-Oriented Induction --- p.27 / Chapter 2.5.4 --- Time Series Analysis --- p.27 / Chapter 3 --- Discovering Temporal Patterns for interval-based events --- p.29 / Chapter 3.1 --- Temporal Database --- p.29 / Chapter 3.2 --- Allen's Taxonomy of Temporal Relationships --- p.31 / Chapter 3.3 --- "Mining Temporal Pattern, AppSeq and LinkSeq" --- p.33 / Chapter 3.3.1 --- A1 and A2 temporal pattern --- p.33 / Chapter 3.3.2 --- "Second Temporal Pattern, LinkSeq" --- p.34 / Chapter 3.4 --- Overview of the Framework --- p.35 / Chapter 3.4.1 --- "Mining Temporal Pattern I, AppSeq" --- p.36 / Chapter 3.4.2 --- "Mining Temporal Pattern II, LinkSeq" --- p.36 / Chapter 3.5 --- Summary --- p.37 / Chapter 4 --- "Mining Temporal Pattern I, AppSeq" --- p.38 / Chapter 4.1 --- Problem Statement --- p.38 / Chapter 4.2 --- Mining A1 Temporal Patterns --- p.40 / Chapter 4.2.1 --- Candidate Generation --- p.43 / Chapter 4.2.2 --- Large k-Items Generation --- p.46 / Chapter 4.3 --- Mining A2 Temporal Patterns --- p.48 / Chapter 4.3.1 --- Candidate Generation: --- p.49 / Chapter 4.3.2 --- Generating Large 2k-Items: --- p.51 / Chapter 4.4 --- Modified AppOne and AppTwo --- p.51 / Chapter 4.5 --- Performance Study --- p.53 / Chapter 4.5.1 --- Experimental Setup --- p.53 / Chapter 4.5.2 --- Experimental Results --- p.54 / Chapter 4.5.3 --- Medical Data --- p.58 / Chapter 4.6 --- Summary --- p.60 / Chapter 5 --- "Mining Temporal Pattern II, LinkSeq" --- p.62 / Chapter 5.1 --- Problem Statement --- p.62 / Chapter 5.2 --- "First Method for Mining LinkSeq, LinkApp" --- p.63 / Chapter 5.3 --- "Second Method for Mining LinkSeq, LinkTwo" --- p.64 / Chapter 5.4 --- "Alternative Method for Mining LinkSeq, LinkTree" --- p.65 / Chapter 5.4.1 --- Sequence Tree: Design --- p.65 / Chapter 5.4.2 --- Construction of seq-tree --- p.69 / Chapter 5.4.3 --- Mining LinkSeq using seq-tree --- p.76 / Chapter 5.5 --- Performance Study --- p.82 / Chapter 5.6 --- Discussions --- p.85 / Chapter 5.7 --- Summary --- p.85 / Chapter 6 --- Conclusion and Future Work --- p.87 / Chapter 6.1 --- Conclusion --- p.87 / Chapter 6.2 --- Future Work --- p.88 / Bibliography --- p.97
|
53 |
Analysing the temporal association among financial news using concept space model.January 2001 (has links)
Law Yee-shan, Carol. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2001. / Includes bibliographical references (leaves 81-89). / Abstracts in English and Chinese. / Chapter CHAPTER ONE --- INTRODUCTION --- p.1 / Chapter 1.1 --- Research Contributions --- p.5 / Chapter 1.2 --- Organization of the thesis --- p.5 / Chapter CHAPTER TWO --- LITERATURE REVIEW --- p.7 / Chapter 2.1 --- Temporal data Association --- p.7 / Chapter 2.1.1 --- Association Rule Mining --- p.8 / Chapter 2.1.2 --- Sequential Patterns Mining --- p.10 / Chapter 2.2 --- Information Retrieval Techniques --- p.11 / Chapter 2.2.1 --- Vector Space model --- p.12 / Chapter 2.2.2 --- Probabilistic model --- p.75 / Chapter CHAPTER THREE --- AN OVERVIEW OF THE PROPOSED APPROACH --- p.16 / Chapter 3.1 --- The Test Bed --- p.19 / Chapter 3.2 --- General Concept Term Identification........................................……… --- p.19 / Chapter 3.3 --- Anchor Document Selection --- p.21 / Chapter 3.4 --- Specific Concept Term Identification --- p.22 / Chapter 3.5 --- Establishment of Associations --- p.22 / Chapter CHAPTER FOUR --- GENERAL CONCEPT TERM IDENTIFICATION --- p.24 / Chapter 4.1 --- Document Pre-processing --- p.25 / Chapter 4.2 --- Stopwording and stemming --- p.29 / Chapter 4.3 --- Word-phrase formation --- p.29 / Chapter 4.4 --- Automatic Indexing of Words and Sentences --- p.30 / Chapter 4.5 --- Relevance Weighting --- p.31 / Chapter 4.5.1 --- Term Frequency and Document Frequency Computation --- p.31 / Chapter 4.5.2 --- Uncommon Data Removal --- p.32 / Chapter 4.5.3 --- Combined Weight Computation --- p.32 / Chapter 4.5.4 --- Cluster Analysis --- p.33 / Chapter 4.6 --- Hopfield Network Classification --- p.35 / Chapter CHAPTER FIVE --- ANCHOR DOCUMENT SELECTION --- p.37 / Chapter 5.1 --- What is an anchor document? --- p.37 / Chapter 5.2 --- Selection Criteria of an anchor document --- p.40 / Chapter CHAPTER SIX --- DISCOVERY OF NEWS ASSOCIATION --- p.44 / Chapter 6.1 --- Specific Concept Term Identification --- p.44 / Chapter 6.2 --- Establishment of Associations --- p.45 / Chapter 6.2.1 --- Anchor document representation --- p.46 / Chapter 6.2.2 --- Similarity measurement --- p.47 / Chapter 6.2.3 --- Formation of a link of news --- p.48 / Chapter CHAPTER SEVEN --- EXPERIMENTAL RESULTS AND ANALYSIS --- p.54 / Chapter 7.1 --- Objective of Experiments --- p.54 / Chapter 7.2 --- Background of Subjects --- p.55 / Chapter 7.3 --- Design of Experiments --- p.55 / Chapter 7.3.1 --- Experimental Data --- p.55 / Chapter 7.3.2 --- Methodology --- p.55 / Anchor document selection --- p.57 / Specific concept term identification --- p.55 / News association --- p.59 / Chapter 7.4 --- Results and Analysis --- p.60 / Anchor document selection --- p.60 / Specific concept term identification --- p.64 / News association --- p.68 / Chapter CHAPTER EIGHT --- CONCLUSIONS AND FUTURE WORK --- p.72 / Chapter 8.1 --- Conclusions --- p.72 / Chapter 8.2 --- Future work --- p.74 / APPENDIX A --- p.76 / APPENDIX B --- p.78 / BIBLIOGRAPHY --- p.81
|
54 |
Mining multi-level association rules using data cubes and mining N-most interesting itemsets.January 2000 (has links)
by Kwong, Wang-Wai Renfrew. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2000. / Includes bibliographical references (leaves 102-105). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgments --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Data Mining Tasks --- p.1 / Chapter 1.1.1 --- Characterization --- p.2 / Chapter 1.1.2 --- Discrimination --- p.2 / Chapter 1.1.3 --- Classification --- p.2 / Chapter 1.1.4 --- Clustering --- p.3 / Chapter 1.1.5 --- Prediction --- p.3 / Chapter 1.1.6 --- Description --- p.3 / Chapter 1.1.7 --- Association Rule Mining --- p.4 / Chapter 1.2 --- Motivation --- p.4 / Chapter 1.2.1 --- Motivation for Mining Multi-level Association Rules Using Data Cubes --- p.4 / Chapter 1.2.2 --- Motivation for Mining N-most Interesting Itemsets --- p.8 / Chapter 1.3 --- Outline of the Thesis --- p.10 / Chapter 2 --- Survey on Previous Work --- p.11 / Chapter 2.1 --- Data Warehousing --- p.11 / Chapter 2.1.1 --- Data Cube --- p.12 / Chapter 2.2 --- Data Mining --- p.13 / Chapter 2.2.1 --- Association Rules --- p.14 / Chapter 2.2.2 --- Multi-level Association Rules --- p.15 / Chapter 2.2.3 --- Multi-Dimensional Association Rules Using Data Cubes --- p.16 / Chapter 2.2.4 --- Apriori Algorithm --- p.19 / Chapter 3 --- Mining Multi-level Association Rules Using Data Cubes --- p.22 / Chapter 3.1 --- Use of Multi-level Concept --- p.22 / Chapter 3.1.1 --- Multi-level Concept --- p.22 / Chapter 3.1.2 --- Criteria of Using Multi-level Concept --- p.23 / Chapter 3.1.3 --- Use of Multi-level Concept in Association Rules --- p.24 / Chapter 3.2 --- Use of Data Cube --- p.25 / Chapter 3.2.1 --- Data Cube --- p.25 / Chapter 3.2.2 --- Mining Multi-level Association Rules Using Data Cubes --- p.26 / Chapter 3.2.3 --- Definition --- p.28 / Chapter 3.3 --- Method for Mining Multi-level Association Rules Using Data Cubes --- p.31 / Chapter 3.3.1 --- Algorithm --- p.33 / Chapter 3.3.2 --- Example --- p.35 / Chapter 3.4 --- Experiment --- p.44 / Chapter 3.4.1 --- Simulation of Data Cube by Array --- p.44 / Chapter 3.4.2 --- Simulation of Data Cube by B+ Tree --- p.48 / Chapter 3.5 --- Discussion --- p.54 / Chapter 4 --- Mining the N-most Interesting Itemsets --- p.56 / Chapter 4.1 --- Mining the N-most Interesting Itemsets --- p.56 / Chapter 4.1.1 --- Criteria of Mining the N-most Interesting itemsets --- p.56 / Chapter 4.1.2 --- Definition --- p.58 / Chapter 4.1.3 --- Property --- p.59 / Chapter 4.2 --- Method for Mining N-most Interesting Itemsets --- p.60 / Chapter 4.2.1 --- Algorithm --- p.60 / Chapter 4.2.2 --- Example --- p.76 / Chapter 4.3 --- Experiment --- p.81 / Chapter 4.3.1 --- Synthetic Data --- p.81 / Chapter 4.3.2 --- Real Data --- p.85 / Chapter 4.4 --- Discussion --- p.98 / Chapter 5 --- Conclusion --- p.100 / Bibliography --- p.101 / Appendix --- p.106 / Chapter A --- Programs for Mining the N-most Interesting Itemset --- p.106 / Chapter A.1 --- Programs --- p.106 / Chapter A.2 --- Data Structures --- p.108 / Chapter A.3 --- Global Variables --- p.109 / Chapter A.4 --- Functions --- p.110 / Chapter A.5 --- Result Format --- p.113 / Chapter B --- Programs for Mining the Multi-level Association Rules Using Data Cube --- p.114 / Chapter B.1 --- Programs --- p.114 / Chapter B.2 --- Data Structure --- p.118 / Chapter B.3 --- Variables --- p.118 / Chapter B.4 --- Functions --- p.119
|
55 |
A study of two problems in data mining: projective clustering and multiple tables association rules mining.January 2002 (has links)
Ng Ka Ka. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2002. / Includes bibliographical references (leaves 114-120). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgement --- p.vii / Chapter I --- Projective Clustering --- p.1 / Chapter 1 --- Introduction to Projective Clustering --- p.2 / Chapter 2 --- Related Work to Projective Clustering --- p.7 / Chapter 2.1 --- CLARANS - Graph Abstraction and Bounded Optimization --- p.8 / Chapter 2.1.1 --- Graph Abstraction --- p.8 / Chapter 2.1.2 --- Bounded Optimized Random Search --- p.9 / Chapter 2.2 --- OptiGrid ´ؤ Grid Partitioning Approach and Density Estimation Function --- p.9 / Chapter 2.2.1 --- Empty Space Phenomenon --- p.10 / Chapter 2.2.2 --- Density Estimation Function --- p.11 / Chapter 2.2.3 --- Upper Bound Property --- p.12 / Chapter 2.3 --- CLIQUE and ENCLUS - Subspace Clustering --- p.13 / Chapter 2.3.1 --- Monotonicity Property of Subspaces --- p.14 / Chapter 2.4 --- PROCLUS Projective Clustering --- p.15 / Chapter 2.5 --- ORCLUS - Generalized Projective Clustering --- p.16 / Chapter 2.5.1 --- Singular Value Decomposition SVD --- p.17 / Chapter 2.6 --- "An ""Optimal"" Projective Clustering" --- p.17 / Chapter 3 --- EPC : Efficient Projective Clustering --- p.19 / Chapter 3.1 --- Motivation --- p.19 / Chapter 3.2 --- Notations and Definitions --- p.21 / Chapter 3.2.1 --- Density Estimation Function --- p.22 / Chapter 3.2.2 --- 1-d Histogram --- p.23 / Chapter 3.2.3 --- 1-d Dense Region --- p.25 / Chapter 3.2.4 --- Signature Q --- p.26 / Chapter 3.3 --- The overall framework --- p.28 / Chapter 3.4 --- Major Steps --- p.30 / Chapter 3.4.1 --- Histogram Generation --- p.30 / Chapter 3.4.2 --- Adaptive discovery of dense regions --- p.31 / Chapter 3.4.3 --- Count the occurrences of signatures --- p.36 / Chapter 3.4.4 --- Find the most frequent signatures --- p.36 / Chapter 3.4.5 --- Refine the top 3m signatures --- p.37 / Chapter 3.5 --- Time and Space Complexity --- p.38 / Chapter 4 --- EPCH: An extension and generalization of EPC --- p.40 / Chapter 4.1 --- Motivation of the extension --- p.40 / Chapter 4.2 --- Distinguish clusters by their projections in different subspaces --- p.43 / Chapter 4.3 --- EPCH: a generalization of EPC by building histogram with higher dimensionality --- p.46 / Chapter 4.3.1 --- Multidimensional histograms construction and dense re- gions detection --- p.46 / Chapter 4.3.2 --- Compressing data objects to signatures --- p.47 / Chapter 4.3.3 --- Merging Similar Signature Entries --- p.49 / Chapter 4.3.4 --- Associating membership degree --- p.51 / Chapter 4.3.5 --- The choice of Dimensionality d of the Histogram --- p.52 / Chapter 4.4 --- Implementation of EPC2 --- p.53 / Chapter 4.5 --- Time and Space Complexity of EPCH --- p.54 / Chapter 5 --- Experimental Results --- p.56 / Chapter 5.1 --- Clustering Quality Measurement --- p.56 / Chapter 5.2 --- Synthetic Data Generation --- p.58 / Chapter 5.3 --- Experimental setup --- p.59 / Chapter 5.4 --- Comparison between EPC and PROCULS --- p.60 / Chapter 5.5 --- Comparison between EPCH and ORCLUS --- p.62 / Chapter 5.5.1 --- Dimensionality of the original space and the associated subspace --- p.65 / Chapter 5.5.2 --- Projection not parallel to original axes --- p.66 / Chapter 5.5.3 --- Data objects belong to more than one cluster under fuzzy clustering --- p.67 / Chapter 5.6 --- Scalability of EPC --- p.68 / Chapter 5.7 --- Scalability of EPC2 --- p.69 / Chapter 6 --- Conclusion --- p.71 / Chapter II --- Multiple Tables Association Rules Mining --- p.74 / Chapter 7 --- Introduction to Multiple Tables Association Rule Mining --- p.75 / Chapter 7.1 --- Problem Statement --- p.77 / Chapter 8 --- Related Work to Multiple Tables Association Rules Mining --- p.80 / Chapter 8.1 --- Aprori - A Bottom-up approach to generate candidate sets --- p.80 / Chapter 8.2 --- VIPER - Vertical Mining with various optimization techniques --- p.81 / Chapter 8.2.1 --- Vertical TID Representation and Mining --- p.82 / Chapter 8.2.2 --- FORC --- p.83 / Chapter 8.3 --- Frequent Itemset Counting across Multiple Tables --- p.84 / Chapter 9 --- The Proposed Method --- p.85 / Chapter 9.1 --- Notations --- p.85 / Chapter 9.2 --- Converting Dimension Tables to internal representation --- p.87 / Chapter 9.3 --- The idea of discovering frequent itemsets without joining --- p.89 / Chapter 9.4 --- Overall Steps --- p.91 / Chapter 9.5 --- Binding multiple Dimension Tables --- p.92 / Chapter 9.6 --- Prefix Tree for FT --- p.94 / Chapter 9.7 --- Maintaining frequent itemsets in FI-trees --- p.96 / Chapter 9.8 --- Frequency Counting --- p.99 / Chapter 10 --- Experiments --- p.102 / Chapter 10.1 --- Synthetic Data Generation --- p.102 / Chapter 10.2 --- Experimental Findings --- p.106 / Chapter 11 --- Conclusion and Future Works --- p.112 / Bibliography --- p.114
|
56 |
Mining association rules with weighted items.January 1998 (has links)
by Cai, Chun Hing. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1998. / Includes bibliographical references (leaves 109-114). / Abstract also in Chinese. / Acknowledgments --- p.ii / Abstract --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Main Categories in Data Mining --- p.1 / Chapter 1.2 --- Motivation --- p.3 / Chapter 1.3 --- Problem Definition --- p.4 / Chapter 1.4 --- Experimental Setup --- p.5 / Chapter 1.5 --- Outline of the thesis --- p.6 / Chapter 2 --- Literature Survey on Data Mining --- p.8 / Chapter 2.1 --- Statistical Approach --- p.8 / Chapter 2.1.1 --- Statistical Modeling --- p.9 / Chapter 2.1.2 --- Hypothesis testing --- p.10 / Chapter 2.1.3 --- Robustness and Outliers --- p.11 / Chapter 2.1.4 --- Sampling --- p.12 / Chapter 2.1.5 --- Correlation --- p.15 / Chapter 2.1.6 --- Quality Control --- p.16 / Chapter 2.2 --- Artificial Intelligence Approach --- p.18 / Chapter 2.2.1 --- Bayesian Network --- p.19 / Chapter 2.2.2 --- Decision Tree Approach --- p.20 / Chapter 2.2.3 --- Rough Set Approach --- p.21 / Chapter 2.3 --- Database-oriented Approach --- p.23 / Chapter 2.3.1 --- Characteristic and Classification Rules --- p.23 / Chapter 2.3.2 --- Association Rules --- p.24 / Chapter 3 --- Background --- p.27 / Chapter 3.1 --- Iterative Procedure: Apriori Gen --- p.27 / Chapter 3.1.1 --- Binary association rules --- p.27 / Chapter 3.1.2 --- Apriori Gen --- p.29 / Chapter 3.1.3 --- Closure Properties --- p.30 / Chapter 3.2 --- Introduction of Weights --- p.31 / Chapter 3.2.1 --- Motivation --- p.31 / Chapter 3.3 --- Summary --- p.32 / Chapter 4 --- Mining weighted binary association rules --- p.33 / Chapter 4.1 --- Introduction of binary weighted association rules --- p.33 / Chapter 4.2 --- Weighted Binary Association Rules --- p.34 / Chapter 4.2.1 --- Introduction --- p.34 / Chapter 4.2.2 --- Motivation behind weights and counts --- p.36 / Chapter 4.2.3 --- K-support bounds --- p.37 / Chapter 4.2.4 --- Algorithm for Mining Weighted Association Rules --- p.38 / Chapter 4.3 --- Mining Normalized Weighted association rules --- p.43 / Chapter 4.3.1 --- Another approach for normalized weighted case --- p.45 / Chapter 4.3.2 --- Algorithm for Mining Normalized Weighted Association Rules --- p.46 / Chapter 4.4 --- Performance Study --- p.49 / Chapter 4.4.1 --- Performance Evaluation on the Synthetic Database --- p.49 / Chapter 4.4.2 --- Performance Evaluation on the Real Database --- p.58 / Chapter 4.5 --- Discussion --- p.65 / Chapter 4.6 --- Summary --- p.66 / Chapter 5 --- Mining Fuzzy Weighted Association Rules --- p.67 / Chapter 5.1 --- Introduction to the Fuzzy Rules --- p.67 / Chapter 5.2 --- Weighted Fuzzy Association Rules --- p.69 / Chapter 5.2.1 --- Problem Definition --- p.69 / Chapter 5.2.2 --- Introduction of Weights --- p.71 / Chapter 5.2.3 --- K-bound --- p.73 / Chapter 5.2.4 --- Algorithm for Mining Fuzzy Association Rules for Weighted Items --- p.74 / Chapter 5.3 --- Performance Evaluation --- p.77 / Chapter 5.3.1 --- Performance of the algorithm --- p.77 / Chapter 5.3.2 --- Comparison of unweighted and weighted case --- p.79 / Chapter 5.4 --- Note on the implementation details --- p.81 / Chapter 5.5 --- Summary --- p.81 / Chapter 6 --- Mining weighted association rules with sampling --- p.83 / Chapter 6.1 --- Introduction --- p.83 / Chapter 6.2 --- Sampling Procedures --- p.84 / Chapter 6.2.1 --- Sampling technique --- p.84 / Chapter 6.2.2 --- Algorithm for Mining Weighted Association Rules with Sampling --- p.86 / Chapter 6.3 --- Performance Study --- p.88 / Chapter 6.4 --- Discussion --- p.91 / Chapter 6.5 --- Summary --- p.91 / Chapter 7 --- Database Maintenance with Quality Control method --- p.92 / Chapter 7.1 --- Introduction --- p.92 / Chapter 7.1.1 --- Motivation of using the quality control method --- p.93 / Chapter 7.2 --- Quality Control Method --- p.94 / Chapter 7.2.1 --- Motivation of using Mil. Std. 105D --- p.95 / Chapter 7.2.2 --- Military Standard 105D Procedure [12] --- p.95 / Chapter 7.3 --- Mapping the Database Maintenance to the Quality Control --- p.96 / Chapter 7.3.1 --- Algorithm for Database Maintenance --- p.98 / Chapter 7.4 --- Performance Evaluation --- p.102 / Chapter 7.5 --- Discussion --- p.104 / Chapter 7.6 --- Summary --- p.105 / Chapter 8 --- Conclusion and Future Work --- p.106 / Chapter 8.1 --- Summary of the Thesis --- p.106 / Chapter 8.2 --- Conclusions --- p.107 / Chapter 8.3 --- Future Work --- p.108 / Bibliography --- p.108 / Appendix --- p.115 / Chapter A --- Generating a random number --- p.115 / Chapter B --- Hypergeometric distribution --- p.116 / Chapter C --- Quality control tables --- p.117 / Chapter D --- Rules extracted from the database --- p.120
|
57 |
A new approach to clustering large databases in data mining.January 2004 (has links)
Lau Hei Yuet. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2004. / Includes bibliographical references (leaves 74-76). / Abstracts in English and Chinese. / Abstract --- p.i / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Cluster Analysis --- p.1 / Chapter 1.2 --- Dissimilarity Measures --- p.3 / Chapter 1.2.1 --- Continuous Data --- p.4 / Chapter 1.2.2 --- Categorical and Nominal Data --- p.4 / Chapter 1.2.3 --- Mixed Data --- p.5 / Chapter 1.2.4 --- Missing Data --- p.6 / Chapter 1.3 --- Outline of the thesis --- p.6 / Chapter 2 --- Clustering Algorithms --- p.9 / Chapter 2.1 --- The k-means Algorithm Family --- p.9 / Chapter 2.1.1 --- The Algorithms --- p.9 / Chapter 2.1.2 --- Choosing the Number of Clusters - the MaxMin Algo- rithm --- p.12 / Chapter 2.1.3 --- Starting Configuration - the MaxMin Algorithm --- p.16 / Chapter 2.2 --- Clustering Using Unidimensional Scaling --- p.16 / Chapter 2.2.1 --- Unidimensional Scaling --- p.16 / Chapter 2.2.2 --- Procedures --- p.17 / Chapter 2.2.3 --- Guttman's Updating Algorithm --- p.18 / Chapter 2.2.4 --- Pliner's Smoothing Algorithm --- p.18 / Chapter 2.2.5 --- Starting Configuration --- p.19 / Chapter 2.2.6 --- Choosing the Number of Clusters --- p.21 / Chapter 2.3 --- Cluster Validation --- p.23 / Chapter 2.3.1 --- Continuous Data --- p.23 / Chapter 2.3.2 --- Nominal Data --- p.24 / Chapter 2.3.3 --- Resampling Method --- p.25 / Chapter 2.4 --- Conclusion --- p.27 / Chapter 3 --- Experimental Results --- p.29 / Chapter 3.1 --- Simulated Data 1 --- p.29 / Chapter 3.2 --- Simulated Data 2 --- p.35 / Chapter 3.3 --- Iris Data --- p.41 / Chapter 3.4 --- Wine Data --- p.47 / Chapter 3.5 --- Mushroom Data --- p.53 / Chapter 3.6 --- Conclusion --- p.59 / Chapter 4 --- Large Database --- p.61 / Chapter 4.1 --- Sliding Windows Algorithm --- p.61 / Chapter 4.2 --- Two-stage Algorithm --- p.63 / Chapter 4.3 --- Three-stage Algorithm --- p.65 / Chapter 4.4 --- Experimental Results --- p.66 / Chapter 4.5 --- Conclusion --- p.68 / Chapter A --- Algorithms --- p.69 / Chapter A.1 --- MaxMin Algorithm --- p.69 / Chapter A.2 --- Sliding Windows Algorithm --- p.70 / Chapter A.3 --- Two-stage Algorithm - Stage One --- p.72 / Chapter A.4 --- Two-stage Algorithm - Stage Two --- p.73 / Bibliography --- p.74
|
58 |
Induction of classification rules and decision trees using genetic algorithms.January 2005 (has links)
Ng Sai-Cheong. / Thesis submitted in: December 2004. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 172-178). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Data Mining --- p.1 / Chapter 1.2 --- Problem Specifications and Motivations --- p.3 / Chapter 1.3 --- Contributions of the Thesis --- p.5 / Chapter 1.4 --- Thesis Roadmap --- p.6 / Chapter 2 --- Related Work --- p.9 / Chapter 2.1 --- Supervised Classification Techniques --- p.9 / Chapter 2.1.1 --- Classification Rules --- p.9 / Chapter 2.1.2 --- Decision Trees --- p.11 / Chapter 2.2 --- Evolutionary Algorithms --- p.19 / Chapter 2.2.1 --- Genetic Algorithms --- p.19 / Chapter 2.2.2 --- Genetic Programming --- p.24 / Chapter 2.2.3 --- Evolution Strategies --- p.26 / Chapter 2.2.4 --- Evolutionary Programming --- p.32 / Chapter 2.3 --- Applications of Evolutionary Algorithms to Induction of Classification Rules --- p.33 / Chapter 2.3.1 --- SCION --- p.33 / Chapter 2.3.2 --- GABIL --- p.34 / Chapter 2.3.3 --- LOGENPRO --- p.35 / Chapter 2.4 --- Applications of Evolutionary Algorithms to Construction of Decision Trees --- p.35 / Chapter 2.4.1 --- Binary Tree Genetic Algorithm --- p.35 / Chapter 2.4.2 --- OC1-GA --- p.36 / Chapter 2.4.3 --- OC1-ES --- p.38 / Chapter 2.4.4 --- GATree --- p.38 / Chapter 2.4.5 --- Induction of Linear Decision Trees using Strong Typing GP --- p.39 / Chapter 2.5 --- Spatial Data Structures and its Applications --- p.40 / Chapter 2.5.1 --- Spatial Data Structures --- p.40 / Chapter 2.5.2 --- Applications of Spatial Data Structures --- p.42 / Chapter 3 --- Induction of Classification Rules using Genetic Algorithms --- p.45 / Chapter 3.1 --- Introduction --- p.45 / Chapter 3.2 --- Rule Learning using Genetic Algorithms --- p.46 / Chapter 3.2.1 --- Population Initialization --- p.47 / Chapter 3.2.2 --- Fitness Evaluation of Chromosomes --- p.49 / Chapter 3.2.3 --- Token Competition --- p.50 / Chapter 3.2.4 --- Chromosome Elimination --- p.51 / Chapter 3.2.5 --- Rule Migration --- p.52 / Chapter 3.2.6 --- Crossover --- p.53 / Chapter 3.2.7 --- Mutation --- p.55 / Chapter 3.2.8 --- Calculating the Number of Correctly Classified Training Samples in a Rule Set --- p.56 / Chapter 3.3 --- Performance Evaluation --- p.56 / Chapter 3.3.1 --- Performance Comparison of the GA-based CPRLS and Various Supervised Classifi- cation Algorithms --- p.57 / Chapter 3.3.2 --- Performance Comparison of the GA-based CPRLS and RS-based CPRLS --- p.68 / Chapter 3.3.3 --- Effects of Token Competition --- p.69 / Chapter 3.3.4 --- Effects of Rule Migration --- p.70 / Chapter 3.4 --- Chapter Summary --- p.73 / Chapter 4 --- Genetic Algorithm-based Quadratic Decision Trees --- p.74 / Chapter 4.1 --- Introduction --- p.74 / Chapter 4.2 --- Construction of Quadratic Decision Trees --- p.76 / Chapter 4.3 --- Evolving the Optimal Quadratic Hypersurface using Genetic Algorithms --- p.77 / Chapter 4.3.1 --- Population Initialization --- p.80 / Chapter 4.3.2 --- Fitness Evaluation --- p.81 / Chapter 4.3.3 --- Selection --- p.81 / Chapter 4.3.4 --- Crossover --- p.82 / Chapter 4.3.5 --- Mutation --- p.83 / Chapter 4.4 --- Performance Evaluation --- p.84 / Chapter 4.4.1 --- Performance Comparison of the GA-based QDT and Various Supervised Classification Algorithms --- p.85 / Chapter 4.4.2 --- Performance Comparison of the GA-based QDT and RS-based QDT --- p.92 / Chapter 4.4.3 --- Effects of Changing Parameters of the GA-based QDT --- p.93 / Chapter 4.5 --- Chapter Summary --- p.109 / Chapter 5 --- Induction of Linear and Quadratic Decision Trees using Spatial Data Structures --- p.111 / Chapter 5.1 --- Introduction --- p.111 / Chapter 5.2 --- Construction of k-D Trees --- p.113 / Chapter 5.3 --- Construction of Generalized Quadtrees --- p.119 / Chapter 5.4 --- Induction of Oblique Decision Trees using Spatial Data Structures --- p.124 / Chapter 5.5. --- Induction of Quadratic Decision Trees using Spatial Data Structures --- p.130 / Chapter 5.6 --- Performance Evaluation --- p.139 / Chapter 5.6.1 --- Performance Comparison with Various Supervised Classification Algorithms --- p.142 / Chapter 5.6.2 --- Effects of Changing the Minimum Number of Training Samples at Each Node of a k-D Tree --- p.155 / Chapter 5.6.3 --- Effects of Changing the Minimum Number of Training Samples at Each Node of a Generalized Quadtree --- p.157 / Chapter 5.6.4 --- Effects of Changing the Size of Datasets . --- p.158 / Chapter 5.7 --- Chapter Summary --- p.160 / Chapter 6 --- Conclusions --- p.164 / Chapter 6.1 --- Contributions --- p.164 / Chapter 6.2 --- Future Work --- p.167 / Chapter A --- Implementation of Data Mining Algorithms Specified in the Thesis --- p.170 / Bibliography --- p.178
|
59 |
Mining a shared concept space for domain adaptation in text mining. / CUHK electronic theses & dissertations collectionJanuary 2011 (has links)
In many text mining applications involving high-dimensional feature space, it is difficult to collect sufficient training data for different domains. One strategy to tackle this problem is to intelligently adapt the trained model from one domain with labeled data to another domain with only unlabeled data. This strategy is known as domain adaptation. However, there are two major limitations of the existing domain adaptation approaches. The first limitation is that they all separate the domain adaptation framework into two separate steps. The first step attempts to minimize the domain gap, and then the second step is to train the predictive model based. on the reweighted instances or transformed feature representation. However, such a transformed representation may encode less information affecting the predictive performance. The second limitation is that they are restricted to using the first-order statistics in a Reproducing Kernel Hilbert Space (RKHS) to measure the distribution difference between the source domain and the target domain. In this thesis, we focus on developing solutions for those two limitations hindering the progress of domain adaptation techniques. / Then we propose an improved symmetric Stein's loss (SSL) function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Based on our proposed distribution gap measure based on second-order statistics, we present another new domain adaptation method called Location and Scatter Matching. The target is to find a good feature representation which can reduce the embedded distribution gap measured by SSL between the source domain and the target domain, at the same time, ensure the new derived representation can encode sufficient discriminants with respect to the label information. Then a standard machine learning algorithm, such as Support Vector Machine (SYM), can be adapted to train classifiers in the new feature subspace across domains. / We conduct a series of experiments on real-world datasets to demonstrate the performance of our proposed approaches comparing with other competitive methods. The results show significant improvement over existing domain adaptation approaches. / We develop a novel model to learn a low-rank shared concept space with respect to two criteria simultaneously: the empirical loss in the source domain, and the embedded distribution gap between the source domain and the target domain. Besides, we can transfer the predictive power from the extracted common features to the characteristic features in the target domain by the feature graph Laplacian. Moreover, we can kernelize our proposed method in the Reproducing Kernel Hilbert Space (RKHS) so as to generalize our model by making use of the powerful kernel functions. We theoretically analyze the expected error evaluated by common convex loss functions in the target domain under the empirical risk minimization framework, showing that the error bound can be controlled by the expected loss in the source domain, and the embedded distribution gap. / Chen, Bo. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 87-95). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
60 |
Efficient and effective outlier detection.January 2003 (has links)
by Chiu Lai Mei. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 142-149). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgement --- p.vi / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Outlier Analysis --- p.2 / Chapter 1.2 --- Problem Statement --- p.4 / Chapter 1.2.1 --- Binary Property of Outlier --- p.4 / Chapter 1.2.2 --- Overlapping Clusters with Different Densities --- p.4 / Chapter 1.2.3 --- Large Datasets --- p.5 / Chapter 1.2.4 --- High Dimensional Datasets --- p.6 / Chapter 1.3 --- Contributions --- p.8 / Chapter 2 --- Related Work in Outlier Detection --- p.10 / Chapter 2.1 --- Outlier Detection --- p.11 / Chapter 2.1.1 --- Clustering-Based Methods --- p.11 / Chapter 2.1.2 --- Distance-Based Methods --- p.14 / Chapter 2.1.3 --- Density-Based Methods --- p.18 / Chapter 2.1.4 --- Deviation-Based Methods --- p.22 / Chapter 2.2 --- Breakthrough Outlier Notion: Degree of Outlier-ness --- p.25 / Chapter 2.2.1 --- LOF: Local Outlier Factor --- p.26 / Chapter 2.2.2 --- Definitions --- p.26 / Chapter 2.2.3 --- Properties --- p.29 / Chapter 2.2.4 --- Algorithm --- p.30 / Chapter 2.2.5 --- Time Complexity --- p.31 / Chapter 2.2.6 --- LOF of High Dimensional Data --- p.31 / Chapter 3 --- LOF': Formula with Intuitive Meaning --- p.33 / Chapter 3.1 --- Definition of LOF' --- p.33 / Chapter 3.2 --- Properties --- p.34 / Chapter 3.3 --- Time Complexity --- p.37 / Chapter 4 --- "LOF"" for Detecting Small Groups of Outliers" --- p.39 / Chapter 4.1 --- "Definition of LOF"" " --- p.40 / Chapter 4.2 --- Properties --- p.41 / Chapter 4.3 --- Time Complexity --- p.44 / Chapter 5 --- GridLOF for Pruning Reasonable Portions from Datasets --- p.46 / Chapter 5.1 --- GridLOF Algorithm --- p.47 / Chapter 5.2 --- Determine Values of Input Parameters --- p.51 / Chapter 5.2.1 --- Number of Intervals w --- p.51 / Chapter 5.2.2 --- Threshold Value σ --- p.52 / Chapter 5.3 --- Advantages --- p.53 / Chapter 5.4 --- Time Complexity --- p.55 / Chapter 6 --- SOF: Efficient Outlier Detection for High Dimensional Data --- p.57 / Chapter 6.1 --- Motivation --- p.57 / Chapter 6.2 --- Notations and Definitions --- p.59 / Chapter 6.3 --- SOF: Subspace Outlier Factor --- p.62 / Chapter 6.3.1 --- Formal Definition of SOF --- p.62 / Chapter 6.3.2 --- Properties of SOF --- p.67 / Chapter 6.4 --- SOF-Algorithm: the Overall Framework --- p.73 / Chapter 6.5 --- Identify Associated Subspaces of Clusters in SOF-Algorithm . . --- p.74 / Chapter 6.5.1 --- Technical Details in Phase I --- p.76 / Chapter 6.6 --- Technical Details in Phase II and Phase III --- p.88 / Chapter 6.6.1 --- Identify Outliers --- p.88 / Chapter 6.6.2 --- Subspace Quantization --- p.90 / Chapter 6.6.3 --- X-Tree Index Structure --- p.91 / Chapter 6.6.4 --- Compute GSOF and SOF --- p.95 / Chapter 6.6.5 --- Assign SO Values --- p.95 / Chapter 6.6.6 --- Multi-threads Programming --- p.96 / Chapter 6.7 --- Time Complexity --- p.97 / Chapter 6.8 --- Strength of SOF-Algorithm --- p.99 / Chapter 7 --- "Experiments on LOF' ,LOF"" and GridLOF" --- p.102 / Chapter 7.1 --- Datasets Used --- p.103 / Chapter 7.2 --- LOF' --- p.103 / Chapter 7.3 --- "LOF"" " --- p.109 / Chapter 7.4 --- GridLOF --- p.114 / Chapter 8 --- Empirical Results of SOF --- p.121 / Chapter 8.1 --- Synthetic Data Generation --- p.121 / Chapter 8.2 --- Experimental Setup --- p.124 / Chapter 8.3 --- Performance Measure --- p.124 / Chapter 8.3.1 --- Quality Measurement --- p.127 / Chapter 8.3.2 --- Scalability of SOF-Algorithm --- p.136 / Chapter 8.3.3 --- Effect of Parameters on SOF-Algorithm --- p.139 / Chapter 9 --- Conclusion --- p.140 / Bibliography --- p.142 / Publication --- p.149
|
Page generated in 0.1073 seconds