Global ETD Search

291	Entity discovery by exploiting contextual structures. / CUHK electronic theses & dissertations collection January 2011 (has links) In text mining, being able to recognize and extract named entities, e.g. Locations, Persons, Organizations, is very useful in many applications. This is usually referred to named entity recognition (NER). This thesis presents a cascaded framework for extracting named entities from text documents. We automatically derive features on a set of documents from different feature templates. To avoid high computational cost incurred by a single-phase approach, we divide the named entity extraction task into a segmentation task and a classification task, reducing the computational cost by an order of magnitude. / To handle cascaded errors that often occur in a sequence of tasks, we investigate and develop three models: maximum-entropy margin-based (MEMB) model, isomeric conditional random field (ICRF) model, and online cascaded reranking (OCR) model. MEMB model makes use of the concept of margin in maximizing log-likelihood. Parameters are trained in a way that they can maximize the "margin" between the decision boundary and the nearest training data points. ICRF model makes use of the concept of joint training. Instead of training each model independently, we design the segmentation and classification models in a way that they can be efficiently trained together under a soft constraint. OCR model is developed by using an online training method to maximize a margin without considering any probability measures, which greatly reduces the training time. It reranks all of the possible outputs from a previous stage based on a total output score. The best output with the highest total score is the final output. / We report experimental evaluations on the GENIA Corpus available from the BioNLP/NLPBA (2004) shared task and the Reuters Corpus available from the CoNLL-2003 shared tasks, which demonstrate the state-of-the-art performance achieved by the proposed models. / Chan, Shing Kit. / Advisers: Wai Lam; Kai Pui Lam. / Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 126-133). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Data mining--Mathematical models Entity-relationship modeling
292	Privacy preserving data publishing. / CUHK electronic theses & dissertations collection January 2008 (has links) The advance of information technologies has enabled various organizations (e.g., census agencies, hospitals) to collect large volumes of sensitive personal data (e.g., census data, medical records). Due to the great research value of such data, it is often released for public benefit purposes, which, however, poses a risk to individual privacy. A typical solution to this problem is to anonymize the data before releasing it to the public. In particular, the anonymization should be conducted in a careful manner, such that the published data not only prevents an adversary from inferring sensitive information, but also remains useful for data analysis. / This thesis prevents an extensive study on the anonymization techniques for privacy preserving data publishing. We explore various aspects of the problem (e.g., definitions of privacy, modeling of the adversary, methodologies of anonymization), and devise novel solutions that address several important issues overlooked by previous work. Experiments with real-world data confirm the effectiveness and efficiency of our techniques. / Xiao, Xiaokui. / Adviser: Yufei Yao. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3618. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 307-314). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Data mining Data protection Database security
293	Entropy-based subspace clustering for mining numerical data. January 1999 (has links) by Cheng, Chun-hung. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1999. / Includes bibliographical references (leaves 72-76). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgments --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Six Tasks of Data Mining --- p.1 / Chapter 1.1.1 --- Classification --- p.2 / Chapter 1.1.2 --- Estimation --- p.2 / Chapter 1.1.3 --- Prediction --- p.2 / Chapter 1.1.4 --- Market Basket Analysis --- p.3 / Chapter 1.1.5 --- Clustering --- p.3 / Chapter 1.1.6 --- Description --- p.3 / Chapter 1.2 --- Problem Description --- p.4 / Chapter 1.3 --- Motivation --- p.5 / Chapter 1.4 --- Terminology --- p.7 / Chapter 1.5 --- Outline of the Thesis --- p.7 / Chapter 2 --- Survey on Previous Work --- p.8 / Chapter 2.1 --- Data Mining --- p.8 / Chapter 2.1.1 --- Association Rules and its Variations --- p.9 / Chapter 2.1.2 --- Rules Containing Numerical Attributes --- p.15 / Chapter 2.2 --- Clustering --- p.17 / Chapter 2.2.1 --- The CLIQUE Algorithm --- p.20 / Chapter 3 --- Entropy and Subspace Clustering --- p.24 / Chapter 3.1 --- Criteria of Subspace Clustering --- p.24 / Chapter 3.1.1 --- Criterion of High Density --- p.25 / Chapter 3.1.2 --- Correlation of Dimensions --- p.25 / Chapter 3.2 --- Entropy in a Numerical Database --- p.27 / Chapter 3.2.1 --- Calculation of Entropy --- p.27 / Chapter 3.3 --- Entropy and the Clustering Criteria --- p.29 / Chapter 3.3.1 --- Entropy and the Coverage Criterion --- p.29 / Chapter 3.3.2 --- Entropy and the Density Criterion --- p.31 / Chapter 3.3.3 --- Entropy and Dimensional Correlation --- p.33 / Chapter 4 --- The ENCLUS Algorithms --- p.35 / Chapter 4.1 --- Framework of the Algorithms --- p.35 / Chapter 4.2 --- Closure Properties --- p.37 / Chapter 4.3 --- Complexity Analysis --- p.39 / Chapter 4.4 --- Mining Significant Subspaces --- p.40 / Chapter 4.5 --- Mining Interesting Subspaces --- p.42 / Chapter 4.6 --- Example --- p.44 / Chapter 5 --- Experiments --- p.49 / Chapter 5.1 --- Synthetic Data --- p.49 / Chapter 5.1.1 --- Data Generation ´ؤ Hyper-rectangular Data --- p.49 / Chapter 5.1.2 --- Data Generation ´ؤ Linearly Dependent Data --- p.50 / Chapter 5.1.3 --- Effect of Changing the Thresholds --- p.51 / Chapter 5.1.4 --- Effectiveness of the Pruning Strategies --- p.53 / Chapter 5.1.5 --- Scalability Test --- p.53 / Chapter 5.1.6 --- Accuracy --- p.55 / Chapter 5.2 --- Real-life Data --- p.55 / Chapter 5.2.1 --- Census Data --- p.55 / Chapter 5.2.2 --- Stock Data --- p.56 / Chapter 5.3 --- Comparison with CLIQUE --- p.58 / Chapter 5.3.1 --- Subspaces with Uniform Projections --- p.60 / Chapter 5.4 --- Problems with Hyper-rectangular Data --- p.62 / Chapter 6 --- Miscellaneous Enhancements --- p.64 / Chapter 6.1 --- Extra Pruning --- p.64 / Chapter 6.2 --- Multi-resolution Approach --- p.65 / Chapter 6.3 --- Multi-threshold Approach --- p.68 / Chapter 7 --- Conclusion --- p.70 / Bibliography --- p.71 / Appendix --- p.77 / Chapter A --- Differential Entropy vs Discrete Entropy --- p.77 / Chapter A.1 --- Relation of Differential Entropy to Discrete Entropy --- p.78 / Chapter B --- Mining Quantitative Association Rules --- p.80 / Chapter B.1 --- Approaches --- p.81 / Chapter B.2 --- Performance --- p.82 / Chapter B.3 --- Final Remarks --- p.83 Data mining Cluster analysis--Data processing
294	A new approach to circular unidimensional scaling. January 2002 (has links) Li Chi Yin. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2002. / Includes bibliographical references (leaves 78-80). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Multidimensional Scaling (MDS) --- p.1 / Chapter 1.2 --- Unidimensional Scaling (UDS) --- p.15 / Chapter 1.3 --- Circular Unidimensional Scaling (CDS) --- p.17 / Chapter 1.4 --- The Goodness of fit of models --- p.24 / Chapter 1.5 --- The admissible transformations of the MDS configuration --- p.26 / Chapter 2 --- "Computational Methods on MDS, UDS and CDS" --- p.29 / Chapter 2.1 --- Classical Scaling --- p.29 / Chapter 2.2 --- Guttman's updating algorithm and Pliner's smoothing algorithm --- p.36 / Chapter 2.3 --- Circular Unidimensional Scaling/Circumplex Model --- p.43 / Chapter 3 --- A new algorithm for CDS --- p.45 / Chapter 3.1 --- Method of choosing a good starting value in Guttman's updating algorithm and Pliner's smoothing algorithm --- p.46 / Chapter 3.2 --- A new approach for circular unidimensional scaling --- p.54 / Chapter 3.3 --- Examples --- p.62 / Chapter 3.3.1 --- Comparison of the new approach to existing method --- p.62 / Chapter 3.3.2 --- Illustrations of application to political data --- p.64 / Chapter 4 --- Conclusion and Extensions --- p.67 / Chapter A --- Figures and Tables --- p.70 / Chapter B --- References --- p.78 Scaling (Social sciences) Multidimensional scaling Data mining
295	Algorithmic aspects of social network mining. / 社会网络挖掘的算法问题研究 / CUHK electronic theses & dissertations collection / Algorithmic aspects of social network mining. / She hui wang luo wa jue de suan fa wen ti yan jiu January 2013 (has links) Li, Ronghua = 社会网络挖掘的算法问题研究 / 李荣华. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 157-171). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / Li, Ronghua = She hui wang luo wa jue de suan fa wen ti yan jiu / Li Ronghua. Data mining--Mathematical models Social networks
296	Student Modeling within a Computer Tutor for Mathematics: Using Bayesian Networks and Tabling Methods Wang, Yutao 15 September 2015 (has links) "Intelligent tutoring systems rely on student modeling to understand student behavior. The result of student modeling can provide assessment for student knowledge, estimation of student¡¯s current affective states (ie boredom, confusion, concentration, frustration, etc), prediction of student performance, and suggestion of the next tutoring steps. There are three focuses of this dissertation. The first focus is on better predicting student performance by adding more information, such as student identity and information about how many assistance students needed. The second focus is to analyze different performance and feature set for modeling student short-term knowledge and longer-term knowledge. The third focus is on improving the affect detectors by adding more features. In this dissertation I make contributions to the field of data mining as well as educational research. I demonstrate novel Bayesian networks for student modeling, and also compared them with each other. This work contributes to educational research by broadening the task of analyzing student knowledge to student knowledge retention, which is a much more important and interesting question for researchers to look at. Additionally, I showed a set of new useful features as well as how to effectively use these features in real models. For instance, in Chapter 5, I showed that the feature of the number of different days a students has worked on a skill is a more predictive feature for knowledge retention. These features themselves are not a contribution to data mining so much as they are to education research more broadly, which can used by other educational researchers or tutoring systems. " Student Modeling Bayesian Networks Educational Data Mining
297	Mining and Managing Neighbor-Based Patterns in Data Streams Yang, Di 09 January 2012 (has links) The current data-intensive world is continuously producing huge volumes of live streaming data through various kinds of electronic devices, such as sensor networks, smart phones, GPS and RFID systems. To understand these data sources and thus better leverage them to serve human society, the demands for mining complex patterns from these high speed data streams have significantly increased in a broad range of application domains, such as financial analysis, social network analysis, credit fraud detection, and moving object monitoring. In this dissertation, we present a framework to tackle the mining and management problem for the family of neighbor-based patterns in data streams, which covers a broad range of popular pattern types, including clusters, outliers, k-nearest neighbors and others. First, we study the problem of efficiently executing single neighbor-based pattern mining queries. We propose a general optimization principle for incremental pattern maintenance in data streams, called "Predicted Views". This general optimization principle exploits the "predictability" of sliding window semantics to eliminate both the computational and storage effort needed for handling the expiration of stream objects, which usually constitutes the most expensive operations for incremental pattern maintenance. Second, the problem of multiple query optimization for neighbor-based pattern mining queries is analyzed, which aims to efficiently execute a heavy workload of neighbor-based pattern mining queries using shared execution strategies. We present an integrated pattern maintenance strategy to represent and incrementally maintain the patterns identified by queries with different query parameters within a single compact structure. Our solution realizes fully shared execution of multiple queries with arbitrary parameter settings. Third, the problem of summarization and matching for neighbor-based patterns is examined. To solve this problem, we first propose a summarization format for each pattern type. Then, we present computation strategies, which efficiently summarize the neighbor-based patterns either during or after the online pattern extraction process. Lastly, to compare patterns extracted on different time horizon of the stream, we design an efficient matching mechanism to identify similar patterns in the stream history for any given pattern of interest to an analyst. Our comprehensive experimental studies, using both synthetic as well as real data from domains of stock trades and moving object monitoring, demonstrate superiority of our proposed strategies over alternate methods in both effectiveness and efficiency. Algorithm Streaming Data Query Processing Data Mining
298	Butterfly: A Model of Provenance Tang, Yaobin 13 March 2009 (has links) Semantically rich metadata is foreseen to be pervasive in tomorrow's cyber world. People are more willing to store metadata in the hope that such extra information will enable a wide range of novel business intelligent applications. Provenance is metadata which describes the derivation history of data. It is considered to have great potential for helping the reasoning, analyzing, validating, monitoring, integrating and reusing of data. Although there are a few application-specific systems equipped with some degree of provenance tracking functionality, few formal models of provenance are present. A general purpose, formal model of provenance is desirable not only to widely promote the storage and inventive usage of provenance, but also to prepare for the emergence of so called provenance management system. In this thesis, I propose Butterfly, a general purpose provenance model, which offers the capability to model, store, and query provenance. It consists of a semantic model for describing provenance, and an extensible algebraic query model for querying provenance. An initial implementation of the provenance model is also briefly discussed. Query Model Provenance Metadata Data mining
299	Hypothesis-Driven Specialization-based Analysis of Gene Expression Association Rules Thakkar, Dharmesh 08 May 2007 (has links) During the development of many diseases such as cancer and diabetes, the pattern of gene expression within certain cells changes. A vital part of understanding these diseases will come from understanding the factors governing gene expression. This thesis work focused on mining association rules in the context of gene expression. We designed and developed a tool that enables domain experts to interactively analyze association rules that describe relationships in genetic data. Association rules in their native form deal with sets of items and associations among them. But domain experts hypothesize that additional factors like relative ordering and spacing of these items are important aspects governing gene expression. We proposed hypothesis-based specializations of association rules to identify biologically significant relationships. Our approach also alleviates the limitations inherent in the conventional association rule mining that uses a support-confidence framework by providing filtering and reordering of association rules according to other measures of interestingness in addition to support and confidence. Our tool supports visualization of genetic data in the context of a rule, which facilitates rule analysis and rule specialization. The improvement in different measures of interestingness (e.g., confidence, lift, and p-value) enabled by our approach is used to evaluate the significance of the specialized rules. bioinformatics gene expression association rules data mining
300	Reaching More Students: A Web-based Intelligent Tutoring System with support for Offline Access Kehrer, Paul H 26 April 2012 (has links) ASSISTments is a web-based intelligent tutoring system that can provide students with immediate feedback when they are doing math homework. Until now, ASSISTments required internet access in order to do nightly homework. Without ASSISTments, students do their work on paper and are not told if they are correct or given help for wrong answers until the next morning at best. We've developed a component that supports 'offline-mode', enabling students without internet access at home to still receive immediate feedback on their responses. Students with laptops download their assignments at school, and then run ASSISTments at home in offline mode, utilizing the browser's application cache and Web Storage API. To evaluate the benefit of having the offline feature, we ran a randomized controlled study that tests the effect of immediate feedback on student learning. Intuition would suggest that providing a student with tutoring and feedback immediately after they submit an answer would lead to better understanding of the material than having them wait until the next day. The results of the study confirmed our hypothesis, and validated the need for 'offline mode.' educational data mining intelligent tutoring systems

Search results