Global ETD Search

81	OLAP on sequence data Chui, Chun-kit, 崔俊傑 January 2010 (has links) published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy OLAP technology. Data mining.
82	Design and analysis of efficient algorithms for finding frequent itemsin a data stream Zhang, Wen, 张问 January 2011 (has links) published_or_final_version / Computer Science / Master / Master of Philosophy Data mining. Algorithms.
83	Automatic identification of hot topics and user clusters from online discussion forums Lai, Yiu-ming., 黎耀明. January 2011 (has links) With the advancement of Internet technology and the changes in the mode of communications, it is found that much first-hand news have been discussed in Internet forums well before they are reported in traditional mass media. Also, this communication channel provides an effective channel for illegal activities such as dissemination of copyrighted movies, threatening messages and online gambling etc. The law enforcement agencies are looking for solutions to monitor these discussion forums for possible criminal activities and download suspected postings as evidence for investigation. The volume of postings is huge, for 10 popular forums in Hong Kong; we found that there are 300,000 new messages every day. In this thesis, we propose an automatic system that tackles this problem. Our proposed system downloads postings from selected discussion forums continuously and employs data mining techniques to identify hot topics and cluster authors into different groups using word based user profiles. Using these data, we try to locate some useful trends and detect crime from the data, the result is discussed afterward with include advantages and limitations of different approaches and at the end, there is a conclusion of the way to solve those problems and provide future direction of this research. / published_or_final_version / Computer Science / Master / Master of Philosophy Data mining. Cluster analysis.
84	Sparse representation and fast processing of massive data Li, Mingfei., 李明飞. January 2012 (has links) Many computational problems involve massive data. A reasonable solution to those problems should be able to store and process the data in a effective manner. In this thesis, we study sparse representation of data streams and metric spaces, which allows for fast and private computation of heavy hitters from distributed streams, and approximate distance queries between points in a metric space. Specifically, we consider application scenarios where an untrusted aggregator wishes to continually monitor the heavy-hitters across a set of distributed streams. Since each stream can contain sensitive data, such as the purchase history of customers, we wish to guarantee the privacy of each stream, while allowing the untrusted aggregator to accurately detect the heavy hitters and their approximate frequencies. Our protocols are scalable in settings where the volume of streaming data is large, since we guarantee low memory usage and processing overhead by each data source, and low communication overhead between the data sources and the aggregator. We also study fault-tolerant spanners in doubling metrics. A subgraph H for a metric space X is called a k-vertex-fault-tolerant t-spanner ((k; t)-VFTS or simply k-VFTS), if for any subset S _ X with \|Sj\|≤k, it holds that dHnS(x; y) ≤ t ∙d(x; y), for any pair of x, y ∈ X \ S. For any doubling metric, we give a basic construction of k-VFTS with stretch arbitrarily close to 1 that has optimal O(kn) edges. We also consider bounded hop-diameter, which is studied in the context of fault-tolerance for the first time even for Euclidean spanners. We provide a construction of k-VFTS with bounded hop-diameter: for m ≥2n, we can reduce the hop-diameter of the above k-VFTS to O(α(m; n)) by adding O(km) edges, where α is a functional inverse of the Ackermann's function. In addition, we construct a fault-tolerant single-sink spanner with bounded maximum degree, and use it to reduce the maximum degree of our basic k-VFTS. As a result, we get a k-VFTS with O(k^2n) edges and maximum degree O(k^2). / published_or_final_version / Computer Science / Master / Master of Philosophy Data mining. Sparse matrices.
85	Budget-limited data disambiguation Yang, Xuan, 楊譞 January 2013 (has links) The problem of data ambiguity exists in a wide range of applications. In this thesis, we study “cost-aware" methods to alleviate the data ambiguity problems in uncertain databases and social-tagging data. In database applications, ambiguous (or uncertain) data may originate from data integration and measurement error of devices. These ambiguous data are maintained by uncertain databases. In many situations, it is possible to “clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement error, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In practice, a cleaning activity often involves a cost, may fail and may not remove all ambiguities. Moreover, the statistical information about how likely database entities can be cleaned may not be precisely known. We model the above aspects with the uncertain database cleaning problem, which requires us to make sensible decisions in selecting entities to clean in order to maximize the amount of ambiguous information removed under a limited budget. To solve this problem, we propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Social tagging data capture web users' textual annotations, called tags, for resources (e.g., webpages and photos). Since tags are given by casual users, they often contain noise (e.g., misspelled words) and may not be able to cover all the aspects of each resource. In this thesis, we design a metric to systematically measure the tagging quality of each resource based on the tags it has received. We propose an incentive-based tagging framework in order to improve the tagging quality. The main idea is to award users some incentive for giving (relevant) tags to resources. The challenge is, how should we allocate incentives to a large set of resources, so as to maximize the improvement of their tagging quality under a limited budget? To solve this problem, we propose a few efficient incentive allocation strategies. Experiments shows that our best strategy provides resources with a close-to-optimal gain in tagging quality. To summarize, we study the problem of budget-limited data disambiguation for uncertain databases and social tagging data \| given a set of objects (entities from uncertain databases or web resources), how can we make sensible decisions about which object to \disambiguate" (to perform a cleaning activity on the entity or ask a user to tag the resource), in order to maximize the amount of ambiguous information reduced under a limited budget. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy Data mining - Mathematical models
86	Discovering meta-paths in large knowledge bases Meng, Changping, 蒙昌平 January 2014 (has links) A knowledge base, such as Yago or DBpedia, can be modeled as a large graph with nodes and edges annotated with class and relationship labels. Recent work has studied how to make use of these rich information sources. In particular, meta-paths, which represent sequences of node classes and edge types between two nodes in a knowledge base, have been proposed for such tasks as information retrieval, decision making, and product recommendation. Current methods assume meta-paths are found by domain experts. However, in a large and complex knowledge base, retrieving meta-paths manually can be tedious and difficult. We thus study how to discover meta-paths automatically. Specifically, users are asked to provide example pairs of nodes that exhibit high proximity. We then investigate how to generate meta-paths that can best explain the relationship between these node pairs. Since this problem is computationally intractable, we propose a greedy algorithm to select the most relevant meta-paths. We also present a data structure to enable efficient execution of this algorithm. We further incorporate hierarchical relationships among node classes in our solutions. Finally, we propose an effective similarity join algorithm in order to generate more node pairs using these meta-paths. Extensive experiments on real knowledge bases show that our approach captures important meta-paths in an efficient and scalable manner. / published_or_final_version / Computer Science / Master / Master of Philosophy Data mining Knowledge management
87	Finding frequent itemsets over bursty data streams Lin, Hong, Bill., 林弘. January 2005 (has links) published_or_final_version / abstract / Computer Science / Master / Master of Philosophy Data mining. Algorithms.
88	Techniques in data stream mining Tong, Suk-man, Ivy., 湯淑敏. January 2005 (has links) published_or_final_version / abstract / Computer Science / Master / Master of Philosophy Database management. Data mining.
89	Emerging substrings for sequence classification Chan, Wing-yan, Sarah, 陳詠欣 January 2003 (has links) published_or_final_version / abstract / toc / Computer Science and Information Systems / Master / Master of Philosophy Data mining. Computer algorithms.
90	Efficient mining of association rules using conjectural information 魯建江, Loo, Kin-kong. January 2001 (has links) published_or_final_version / Computer Science and Information Systems / Master / Master of Philosophy Data mining. Computer algorithms.

Search results