Global ETD Search

Return to search

Budget-limited data disambiguation

The problem of data ambiguity exists in a wide range of applications. In this thesis, we study “cost-aware" methods to alleviate the data ambiguity problems in uncertain databases and social-tagging data.

In database applications, ambiguous (or uncertain) data may originate from data integration and measurement error of devices. These ambiguous data are maintained by uncertain databases. In many situations, it is possible to “clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement error, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In practice, a cleaning activity often involves a cost, may fail and may not remove all ambiguities. Moreover, the statistical information about how likely database entities can be cleaned may not be precisely known. We model the above aspects with the uncertain database cleaning problem, which requires us to make sensible decisions in selecting entities to clean in order to maximize the amount of ambiguous information removed under a limited budget. To solve this problem, we propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness.

Social tagging data capture web users' textual annotations, called tags, for resources (e.g., webpages and photos). Since tags are given by casual users, they often contain noise (e.g., misspelled words) and may not be able to cover all the aspects of each resource. In this thesis, we design a metric to systematically measure the tagging quality of each resource based on the tags it has received. We propose an incentive-based tagging framework in order to improve the tagging quality. The main idea is to award users some incentive for giving (relevant) tags to resources. The challenge is, how should we allocate incentives to a large set of resources, so as to maximize the improvement of their tagging quality under a limited budget? To solve this problem, we propose a few efficient incentive allocation strategies. Experiments shows that our best strategy provides resources with a close-to-optimal gain in tagging quality.

To summarize, we study the problem of budget-limited data disambiguation for uncertain databases and social tagging data | given a set of objects (entities from uncertain databases or web resources), how can we make sensible decisions about which object to \disambiguate" (to perform a cleaning activity on the entity or ask a user to tag the resource), in order to maximize the amount of ambiguous information reduced under a limited budget. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy

Data mining - Mathematical models

Identifer	oai:union.ndltd.org:HKU/oai:hub.hku.hk:10722/196458
Date	January 2013
Creators	Yang, Xuan, 楊譞
Contributors	Cheung, DWL, Cheng, CK
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Source Sets	Hong Kong University Theses
Language	English
Detected Language	English
Type	PG_Thesis
Rights	Creative Commons: Attribution 3.0 Hong Kong License, The author retains all proprietary rights, (such as patent rights) and the right to use in future works.
Relation	HKU Theses Online (HKUTO)

Page generated in 0.002 seconds

Budget-limited data disambiguation

Description

Links & Downloads

Tags

Additional Fields