101 |
Embracing Incompleteness in Schema MappingsRodriguez-Gianolli, Patricia 09 August 2013 (has links)
Various forms of information integration have become ubiquitous in current Business Intelligence (BI) technologies. In many cases, the semantic relationship between heterogeneous data sources is specified using high-level declarative rules, called schema mappings. For decades, Skolem functions have been regarded as an important tool in schema mappings as they permit a precise representation of incomplete information. The powerful mapping language of second-order tuple generating dependencies (SO tgds) permits arbitrary Skolem functions and has been proven to be the right class for modeling many integration problems, such as composition and correlation of mappings. This language is strictly more powerful than the languages used in many integration systems, including source-to-target and nested tgds which are both first-order (FO) languages (commonly known as GLAV and nested GLAV mappings). An important class of GLAV mappings are Local-As-View (LAV) tgds, which has found important application in data integration. These FO mapping languages are known to have more desirable programmatic and computational properties. In this thesis, we present a number of techniques for translating some SO tgds into equivalent, more manageable FO schema mappings. Our results rely on understanding and controlling the presence of incompleteness in mappings. We show that the composition of LAV mappings is not only FO, but can always be expressed as a LAV mapping. As a byproduct, we show that the problem of recovery checking for LAV mappings becomes tractable, in contrast to the case of GLAV mappings for which it is known to be undecidable. We introduce two approaches for transforming SO tgds into equivalent nested GLAV mappings. Our approach considers the presence of source constraints, and provides sufficient conditions for when the rich Skolem functions in SO tgds are well-behaved and have an FO semantics. We experimentally show that these conditions are able to handle a very large number of real schema mappings. Last, we propose a first-step for embracing incompleteness in the context of BI applications. Specifically, we present elements of a formal framework for vivifying data with respect to a business model. We view the task of discovering data-to-business interpretations as one of removing incompleteness from these mappings.
|
102 |
Record Linkage for Web DataHassanzadeh, Oktie 15 August 2013 (has links)
Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task.
This thesis studies the record linkage problem for Web data sources. Our hypothesis is that a generic and extensible set of linkage algorithms combined within an easy-to-use framework that integrates and allows tailoring and combining of these algorithms can be used to effectively link large collections of Web data from different domains.
To this end, we first present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery.
Effective specification of requirements for linking records across multiple data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Schema or attribute matching is often done with the goal of aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage points and present the first linkage point discovery algorithms.
We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semi-structured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types and their relationships out of semi-structured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web.
|
103 |
Embracing Incompleteness in Schema MappingsRodriguez-Gianolli, Patricia 09 August 2013 (has links)
Various forms of information integration have become ubiquitous in current Business Intelligence (BI) technologies. In many cases, the semantic relationship between heterogeneous data sources is specified using high-level declarative rules, called schema mappings. For decades, Skolem functions have been regarded as an important tool in schema mappings as they permit a precise representation of incomplete information. The powerful mapping language of second-order tuple generating dependencies (SO tgds) permits arbitrary Skolem functions and has been proven to be the right class for modeling many integration problems, such as composition and correlation of mappings. This language is strictly more powerful than the languages used in many integration systems, including source-to-target and nested tgds which are both first-order (FO) languages (commonly known as GLAV and nested GLAV mappings). An important class of GLAV mappings are Local-As-View (LAV) tgds, which has found important application in data integration. These FO mapping languages are known to have more desirable programmatic and computational properties. In this thesis, we present a number of techniques for translating some SO tgds into equivalent, more manageable FO schema mappings. Our results rely on understanding and controlling the presence of incompleteness in mappings. We show that the composition of LAV mappings is not only FO, but can always be expressed as a LAV mapping. As a byproduct, we show that the problem of recovery checking for LAV mappings becomes tractable, in contrast to the case of GLAV mappings for which it is known to be undecidable. We introduce two approaches for transforming SO tgds into equivalent nested GLAV mappings. Our approach considers the presence of source constraints, and provides sufficient conditions for when the rich Skolem functions in SO tgds are well-behaved and have an FO semantics. We experimentally show that these conditions are able to handle a very large number of real schema mappings. Last, we propose a first-step for embracing incompleteness in the context of BI applications. Specifically, we present elements of a formal framework for vivifying data with respect to a business model. We view the task of discovering data-to-business interpretations as one of removing incompleteness from these mappings.
|
104 |
Integration of vector datasetsHope, Susannah Jayne January 2008 (has links)
As the spatial information industry moves from an era of data collection to one of data maintenance, new integration methods to consolidate or to update datasets are required. These must reduce the discrepancies that are becoming increasingly apparent when spatial datasets are overlaid. It is essential that any such methods consider the quality characteristics of, firstly, the data being integrated and, secondly, the resultant data. This thesis develops techniques that give due consideration to data quality during the integration process.
|
105 |
A framework to support developers in the integration and application of linked and open dataHeuss, Timm January 2016 (has links)
In the last years, the number of freely available Linked and Open Data datasets has multiplied into tens of thousands. The numbers of applications taking advantage of it, however, have not. Thus, large portions of potentially valuable data remain unexploited and are inaccessible for lay users. Therefore the upfront investment in releasing data in the first place is hard to justify. The lack of applications needs to be addressed in order not to undermine efforts put into Linked and Open Data. In existing research, strong indicators can be found that the dearth of applications is due to a lack of pragmatic, working architectures supporting these applications and guiding developers. In this thesis, a new architecture for the integration and application of Linked and Open Data is presented. Fundamental design decisions are backed up by two studies: firstly, based on real-world Linked and Open Data samples, characteristic properties are identified. A key finding is the fact that large amounts of structured data display tabular structures, do not use clear licensing and involve multiple different file formats. Secondly, following on from that study, a comparison of storage choices in relevant query scenarios is made. It includes the de-facto standard storage choice in this domain, Triples Stores, as well as relational and NoSQL approaches. Results show significant performance deficiencies of some technologies in certain scenarios. Consequently, when integrating Linked and Open Data in scenarios with application-specific entities, the first choice of storage is relational databases. Combining these findings and related best practices of existing research, a prototype framework is implemented using Java 8 and Hibernate. As a proof-of-concept it is employed in an existing Linked and Open Data integration project. Thereby, it is shown that a best practice architectural component is introduced successfully, while development effort to implement specific program code can be simplified. Thus, the present work provides an important foundation for the development of semantic applications based on Linked and Open Data and potentially leads to a broader adoption of such applications.
|
106 |
Indexing and querying dataspacesMergen, Sérgio Luis Sardi January 2011 (has links)
Over theWeb, distributed and heterogeneous sources with structured and related content form rich repositories of information commonly referred to as dataspaces. To provide access to this heterogeneous data, information integration systems have traditionally relied on the availability of a mediated schema, along with mappings between this schema and the schema of the source schemas. On dataspaces, where sources are plentiful, autonomous and extremely volatile, a system based on the existence of a pre-defined mediated schema and mapping information presents several drawbacks. Notably, the cost of keeping the mappings up to date as new sources are found or existing sources change can be prohibitively high. We propose a novel querying architecture that requires neither a mediated schema nor source mappings, which is based mainly on indexing mechanisms and on-the-fly rewriting algorithms. Our indexes are designed for data that is represented as relations, and are able to capture the structure of the sources, their instances and the connections between them. In the absence of a mediated schema, the user formulates structured queries based on what she expects to find. These queries are rewritten using a best-effort approach: the proposed rewriting algorithms compare a user query against the source schemas and produces a set of rewritings based on the matches found. Based on this architecture, two different querying approaches are tested. Experiments show that the indexing and rewriting algorithms are scalable, i.e., able to handle a very large number of structured Web sources; and that support simple, yet expressive queries that exploit the inherent structure of the data.
|
107 |
Um ambiente para processamento de consultas federadas em linked data Mashups / An environment for federated query processing in linked data MashupsMagalhães, Regis Pires January 2012 (has links)
MAGALHÃES, Regis Pires. Um ambiente para processamento de consultas federadas em linked data Mashups. 2012. 117 f. Dissertação (Mestrado em ciência da computação)- Universidade Federal do Ceará, Fortaleza-CE, 2012. / Submitted by Elineudson Ribeiro (elineudsonr@gmail.com) on 2016-07-12T16:08:12Z
No. of bitstreams: 1
2012_dis_rpmagalhaes.pdf: 2883929 bytes, checksum: 1a04484a7e875cd8ead588d91693577a (MD5) / Approved for entry into archive by Rocilda Sales (rocilda@ufc.br) on 2016-07-21T16:05:44Z (GMT) No. of bitstreams: 1
2012_dis_rpmagalhaes.pdf: 2883929 bytes, checksum: 1a04484a7e875cd8ead588d91693577a (MD5) / Made available in DSpace on 2016-07-21T16:05:44Z (GMT). No. of bitstreams: 1
2012_dis_rpmagalhaes.pdf: 2883929 bytes, checksum: 1a04484a7e875cd8ead588d91693577a (MD5)
Previous issue date: 2012 / Semantic Web technologies like RDF model, URIs and SPARQL query language, can reduce the complexity of data integration by making use of properly established and described links between sources.However, the difficulty to formulate distributed queries has been a challenge to harness the potential of these technologies due to autonomy, distribution and vocabulary of heterogeneous data sources. This scenario demands effective mechanisms for integrating data on Linked Data.Linked Data Mashups allow users to query and integrate structured and linked data on the web. This work proposes two architectures of Linked Data Mashups: one based on the use of mediators and the other based on the use of Linked Data Mashup Services (LIDMS). A module for efficient execution of federated query plans on Linked Data has been developed and is a component common to both proposed architectures.The execution module feasibility has been demonstrated through experiments. Furthermore, a LIDMS execution Web environment also has been defined and implemented as contributions of this work. / Tecnologias da Web Semântica como modelo RDF, URIs e linguagem de consulta SPARQL, podem reduzir a complexidade de integração de dados ao fazer uso de ligações corretamente estabelecidas e descritas entre fontes.No entanto, a dificuldade para formulação de consultas distribuídas tem sido um obstáculo para aproveitar o potencial dessas tecnologias em virtude da autonomia, distribuição e vocabulário heterogêneo das fontes de dados.Esse cenário demanda mecanismos eficientes para integração de dados sobre Linked Data.Linked Data Mashups permitem aos usuários executar consultas e integrar dados estruturados e vinculados na web.O presente trabalho propõe duas arquiteturas de Linked Data Mashups:uma delas baseada no uso de mediadores e a outra baseada no uso de Linked Data Mashup Services (LIDMS). Um módulo para execução eficiente de planos de consulta federados sobre Linked Data foi desenvolvido e é um componente comum a ambas as arquiteturas propostas.A viabilidade do módulo de execução foi demonstrada através de experimentos. Além disso, um ambiente Web para execução de LIDMS também foi definido e implementado como contribuições deste trabalho.
|
108 |
Pay-as-you-go instance-level integrationMaskat, Ruhaila January 2016 (has links)
With the growing demand for information in various domains, sharing of information from heterogeneous data sources is now a necessity. Data integration approaches promise to combine data from these different sources and present to the user a single, unified view of these data. However, although these approaches offer high quality services for the managing and integrating of data, they come with a high cost. This is because a great amount of manual effort to form relationships across data sources is needed to set up the data integration system. A newer variant of data integration, known as dataspaces, aims to spread the large manual effort spent at the start of the data integration system to the rest of the system's phases. This is achieved by soliciting from the user their feedback on a chosen artefact of a dataspace, either by explicit ways or implicitly. This practice is known as pay-as-you-go, where a user continuously pays to the data integration system, by providing feedback, to gain improvements in the quality of data integration. This PhD addresses two challenges in data integration by using pay-as-you-go approaches. The first is to identify instances relevant to a user's information need, calling for semantic mappings to be closely considered. Our contribution is a technique that ranks mappings with the help of implicit user feedback (i.e., terms found in query logs). Our evaluation shows that to produce stable rankings, our technique does not require large-sized query logs, and that our generated ranking is able to respond satisfactorily to the amount of terms inclined towards a particular data source, where we describe it as skew. The second challenge that we address is the identification of duplicate instances from disparate data sources. We contribute a strategy that uses explicitly-obtained user feedback to drive an evolutionary search algorithm to find suitable parameters for an underlying clustering algorithm. Our experiments show that optimising the algorithm's parameters and introducing attribute weights produces fitter clusters than clustering alone. However, our strategy to improve on integration quality can be quite expensive. Therefore, we propose a pruning technique to select from a dataset any records that are informative. Our experiment shows that on most of the datasets, our pruner produce comparably fit clusters with more feedback received.
|
109 |
Indexing and querying dataspacesMergen, Sérgio Luis Sardi January 2011 (has links)
Over theWeb, distributed and heterogeneous sources with structured and related content form rich repositories of information commonly referred to as dataspaces. To provide access to this heterogeneous data, information integration systems have traditionally relied on the availability of a mediated schema, along with mappings between this schema and the schema of the source schemas. On dataspaces, where sources are plentiful, autonomous and extremely volatile, a system based on the existence of a pre-defined mediated schema and mapping information presents several drawbacks. Notably, the cost of keeping the mappings up to date as new sources are found or existing sources change can be prohibitively high. We propose a novel querying architecture that requires neither a mediated schema nor source mappings, which is based mainly on indexing mechanisms and on-the-fly rewriting algorithms. Our indexes are designed for data that is represented as relations, and are able to capture the structure of the sources, their instances and the connections between them. In the absence of a mediated schema, the user formulates structured queries based on what she expects to find. These queries are rewritten using a best-effort approach: the proposed rewriting algorithms compare a user query against the source schemas and produces a set of rewritings based on the matches found. Based on this architecture, two different querying approaches are tested. Experiments show that the indexing and rewriting algorithms are scalable, i.e., able to handle a very large number of structured Web sources; and that support simple, yet expressive queries that exploit the inherent structure of the data.
|
110 |
A Bayesian Synthesis Approach to Data Fusion Using Augmented Data-Dependent PriorsJanuary 2017 (has links)
abstract: The process of combining data is one in which information from disjoint datasets sharing at least a number of common variables is merged. This process is commonly referred to as data fusion, with the main objective of creating a new dataset permitting more flexible analyses than the separate analysis of each individual dataset. Many data fusion methods have been proposed in the literature, although most utilize the frequentist framework. This dissertation investigates a new approach called Bayesian Synthesis in which information obtained from one dataset acts as priors for the next analysis. This process continues sequentially until a single posterior distribution is created using all available data. These informative augmented data-dependent priors provide an extra source of information that may aid in the accuracy of estimation. To examine the performance of the proposed Bayesian Synthesis approach, first, results of simulated data with known population values under a variety of conditions were examined. Next, these results were compared to those from the traditional maximum likelihood approach to data fusion, as well as the data fusion approach analyzed via Bayes. The assessment of parameter recovery based on the proposed Bayesian Synthesis approach was evaluated using four criteria to reflect measures of raw bias, relative bias, accuracy, and efficiency. Subsequently, empirical analyses with real data were conducted. For this purpose, the fusion of real data from five longitudinal studies of mathematics ability varying in their assessment of ability and in the timing of measurement occasions was used. Results from the Bayesian Synthesis and data fusion approaches with combined data using Bayesian and maximum likelihood estimation methods were reported. The results illustrate that Bayesian Synthesis with data driven priors is a highly effective approach, provided that the sample sizes for the fused data are large enough to provide unbiased estimates. Bayesian Synthesis provides another beneficial approach to data fusion that can effectively be used to enhance the validity of conclusions obtained from the merging of data from different studies. / Dissertation/Thesis / Doctoral Dissertation Psychology 2017
|
Page generated in 0.1449 seconds