Return to search

Metadata-Driven Data Integration

Data has an undoubtable impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. Thus, in order to carry on these data exploitation tasks, organizations must first perform data integration combining data from multiple sources to yield a unified view over them. Yet, the integration of massive and heterogeneous amounts of data requires revisiting the traditional integration assumptions to cope with the new requirements posed by such data-intensive settings.This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a semantic layer, implemented via a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities.We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture serves as a blueprint to deploy a stack of systems, its core being the metadata repository. Next, we propose a graph-based metadata model as formalism for metadata management. We focus on supporting schema and data source evolution, a predominant factor on the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in contemporary data-intensive ecosystems. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished

Identiferoai:union.ndltd.org:ulb.ac.be/oai:dipot.ulb.ac.be:2013/285547
Date16 May 2019
CreatorsNadal Francesch, Sergi
ContributorsVansummeren, Stijn, Abello, Alberto, Romero, Oscar, Zimanyi, Esteban, Sakr, Mahmoud, Fletcher, George, Mena, Eduardo, Wrembel, Robert
PublisherUniversite Libre de Bruxelles, Universitat Politècnica de Catalunya, Facultat d'Informàtica de Barcelona, Enginyeria de Serveis i Sistemes d'Informació - ERASMUS MUNDUS DOCTORAL DEGREE IN INFORMATION TECHNOLOGIES FOR BUSINESS INTELLIGENCE, Université libre de Bruxelles, Ecole polytechnique de Bruxelles – Informatique, Bruxelles
Source SetsUniversité libre de Bruxelles
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/doctoralThesis, info:ulb-repo/semantics/doctoralThesis, info:ulb-repo/semantics/openurl/vlink-dissertation
Format1 v. (186 p.), 3 full-text file(s): application/pdf | application/pdf | application/pdf
Rights3 full-text file(s): info:eu-repo/semantics/openAccess | info:eu-repo/semantics/restrictedAccess | info:eu-repo/semantics/closedAccess

Page generated in 0.0023 seconds