abstract: Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this thesis, I provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. I thus avoid the necessity for a domain expert or master data. I also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. A Map-Reduce architecture to perform this computation in a distributed manner is also shown. I evaluate these methods over both synthetic and real data. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2014
Identifer | oai:union.ndltd.org:asu.edu/item:25942 |
Date | January 2014 |
Contributors | De, Sushovan (Author), Kambhampati, Subbarao (Advisor), Chen, Yi (Committee member), Candan, K. Selçuk (Committee member), Liu, Huan (Committee member), Arizona State University (Publisher) |
Source Sets | Arizona State University |
Language | English |
Detected Language | English |
Type | Doctoral Dissertation |
Format | 99 pages |
Rights | http://rightsstatements.org/vocab/InC/1.0/, All Rights Reserved |
Page generated in 0.0022 seconds