Return to search

DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents

Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:75120
Date09 June 2021
CreatorsEberius, Julian, Werner, Christopher, Thiele, Maik, Braunschweig, Katrin, Dannecker, Lars, Lehner, Wolfgang
PublisherACM
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/acceptedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess
Relation978-1-4503-2263-8, 10.1145/2505515.2508210

Page generated in 0.0023 seconds