Global ETD Search

Return to search

Relational Data Curation by Deduplication, Anonymization, and Diversification

Enterprises acquire large amounts of data from a variety of sources with the goal of extracting valuable insights and enabling informed analysis. Unfortunately, organizations continue to be hindered by poor data quality as they wrangle with their data to extract value since most real datasets are rarely error-free. Poor data quality is a pervasive problem that spans across all industries causing unreliable data analysis, and costing billions of dollars. The large body of datasets, the pace of data acquisition, and the heterogeneity of data sources pose challenges towards achieving high-quality data. These challenges are further exacerbated with data privacy and data diversity requirements. In this thesis, we study and propose solutions to address data duplication, managing the trade-off between data cleaning and data privacy, and computing diverse data instances.

In the first part of this thesis, we address the data duplication problem. We propose a duplication detection framework, which combines word-embeddings with constraints among attributes to improve the accuracy of deduplication. We propose a set of constraint-based statistical features to capture the semantic relationship among attributes. We showed that our techniques achieve comparative accuracy on real datasets. In the second part of this thesis, we study the problem of data privacy and data cleaning, and we present a Privacy-Aware data Cleaning-As-a-Service (PACAS) framework to protect privacy during the cleaning process. Our evaluation shows that PACAS safeguards semantically related sensitive values, and provides lower repair errors compared to existing privacy-aware cleaning techniques. In the third part of this thesis, we study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints, and propose an algorithm to seek a k-anonymous relation with value suppression as well as satisfying given diversity constraints. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines. / Thesis / Doctor of Philosophy (PhD)

http://hdl.handle.net/11375/26009

data quality

data cleaning

data privacy

Identifer	oai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/26009
Date	January 2020
Creators	Huang, Yu
Contributors	Chiang, Fei, Computing and Software
Source Sets	McMaster University
Language	English
Detected Language	English
Type	Thesis

Page generated in 0.0016 seconds

Relational Data Curation by Deduplication, Anonymization, and Diversification

Description

Links & Downloads

Tags

Additional Fields