Global ETD Search

Return to search

Unsupervised Bayesian Data Cleaning Techniques for Structured Data

abstract: Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this thesis, I provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. I thus avoid the necessity for a domain expert or master data. I also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. A Map-Reduce architecture to perform this computation in a distributed manner is also shown. I evaluate these methods over both synthetic and real data. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2014

http://hdl.handle.net/2286/R.I.25942

Computer science

Consistent Query Answering

Databases

Data Cleaning

Information Retrieval

Probabilistic Databases

Identifer	oai:union.ndltd.org:asu.edu/item:25942
Date	January 2014
Contributors	De, Sushovan (Author), Kambhampati, Subbarao (Advisor), Chen, Yi (Committee member), Candan, K. Selçuk (Committee member), Liu, Huan (Committee member), Arizona State University (Publisher)
Source Sets	Arizona State University
Language	English
Detected Language	English
Type	Doctoral Dissertation
Format	99 pages
Rights	http://rightsstatements.org/vocab/InC/1.0/, All Rights Reserved

Page generated in 0.0022 seconds

Unsupervised Bayesian Data Cleaning Techniques for Structured Data

Description

Links & Downloads

Tags

Additional Fields