Global ETD Search

1	Cardinality estimation in ETL processes Lehner, Wolfgang, Thiele, Maik, Kiefer, Tim 22 April 2022 (has links) The cardinality estimation in ETL processes is particularly difficult. Aside from the well-known SQL operators, which are also used in ETL processes, there are a variety of operators without exact counterparts in the relational world. In addition to those, we find operators that support very specific data integration aspects. For such operators, there are no well-examined statistic approaches for cardinality estimations. Therefore, we propose a black-box approach and estimate the cardinality using a set of statistic models for each operator. We discuss different model granularities and develop an adaptive cardinality estimation framework for ETL processes. We map the abstract model operators to specific statistic learning approaches (regression, decision trees, support vector machines, etc.) and evaluate our cardinality estimations in an extensive experimental study. info:eu-repo/classification/ddc/004 ddc:004
2	DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents Eberius, Julian, Werner, Christopher, Thiele, Maik, Braunschweig, Katrin, Dannecker, Lars, Lehner, Wolfgang 09 June 2021 (has links) Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables. info:eu-repo/classification/ddc/004 ddc:004

Search results

Cardinality estimation in ETL processes

DeExcelerator: A Framework for Extracting Relational Data From Partially Structured Documents