Global ETD Search

Return to search

Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2) the accuracy of rule-based deduplication degrades when there are missing data values, significantly reducing the efficacy of the expert-defined deduplication rules.
The first technique is a novel rule-level match-score fusion algorithm that employs kernel-machine-based learning to discover the decision threshold for the overall system automatically. The second is a novel clue-level match-score fusion algorithm that addresses both Problem 1 and 2. This unique solution provides robustness against missing/incomplete record data via the selection of a best-fit support vector machine. Empirical evidence shows that the combination of these two novel solutions eliminates two critical long-standing problems in deduplication, providing accurate and robust results in a critical area of rule-based deduplication.

support vector machine

svm

Computer Sciences

Identifer	oai:union.ndltd.org:UTAHS/oai:digitalcommons.usu.edu:etd-1783
Date	01 December 2010
Creators	Dinerstein, Jared
Publisher	DigitalCommons@USU
Source Sets	Utah State University
Detected Language	English
Type	text
Format	application/pdf
Source	All Graduate Theses and Dissertations
Rights	Copyright for this work is held by the author. Transmission or reproduction of materials protected by copyright beyond that allowed by fair use requires the written permission of the copyright owners. Works not in the public domain cannot be commercially exploited without permission of the copyright owner. Responsibility for any use rests exclusively with the user. For more information contact Andrew Wesolek (andrew.wesolek@usu.edu).

Page generated in 0.0015 seconds

Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

Description

Links & Downloads

Tags

Additional Fields