Global ETD Search

Return to search

Publish-Time Data Integration for Open Data Platforms

Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without resorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics maintained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system's response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives.

info:eu-repo/classification/ddc/004

ddc:004

Identifer	oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:80645
Date	16 September 2022
Creators	Eberius, Julian, Damme, Patrick, Braunschweig, Katrin, Thiele, Maik, Lehner, Wolfgang
Publisher	ACM
Source Sets	Hochschulschriftenserver (HSSS) der SLUB Dresden
Language	English
Detected Language	English
Type	info:eu-repo/semantics/acceptedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rights	info:eu-repo/semantics/openAccess
Relation	978-1-4503-2020-7, 1, 10.1145/2500410.2500413

Page generated in 0.0053 seconds

Publish-Time Data Integration for Open Data Platforms

Description

Links & Downloads

Tags

Additional Fields