• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 3
  • Tagged with
  • 6
  • 5
  • 5
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Inferring information about correspondences between data sources for dataspaces

Guo, Chenjuan January 2011 (has links)
Traditional data integration offers high quality services for managing and querying interrelated but heterogeneous data sources but at a high cost. This is because a significant amount of manual effort is required to help specify precise relationships between the data sources in order to set up a data integration system. The recent proposed vision of dataspaces aims to reduce the upfront effort required to set up the system. A possible solution to approaching this aim is to infer schematic correspondences between the data sources, thus enabling the development of automated means for bootstrapping dataspaces. In this thesis, we discuss a two-step research programme to automatically infer schematic correspondences between data sources. In the first step, we investigate the effectiveness of existing schema matching approaches for inferring schematic correspondences and contribute a benchmark, called MatchBench, to achieve this aim. In the second step, we contribute an evolutionary search method to identify the set of entity-level relationships (ELRs) between data sources that qualify as entity-level schematic correspondences. Specifically, we model the requirements using a vector space model. For each resulting ELR we further identify a set of attribute-level relationships (ALRs) that qualify as attribute-level schematic correspondences. We demonstrate the effectiveness of the contributed inference technique using both MatchBench scenarios and real world scenarios.
2

Indexing and querying dataspaces

Mergen, Sérgio Luis Sardi January 2011 (has links)
Over theWeb, distributed and heterogeneous sources with structured and related content form rich repositories of information commonly referred to as dataspaces. To provide access to this heterogeneous data, information integration systems have traditionally relied on the availability of a mediated schema, along with mappings between this schema and the schema of the source schemas. On dataspaces, where sources are plentiful, autonomous and extremely volatile, a system based on the existence of a pre-defined mediated schema and mapping information presents several drawbacks. Notably, the cost of keeping the mappings up to date as new sources are found or existing sources change can be prohibitively high. We propose a novel querying architecture that requires neither a mediated schema nor source mappings, which is based mainly on indexing mechanisms and on-the-fly rewriting algorithms. Our indexes are designed for data that is represented as relations, and are able to capture the structure of the sources, their instances and the connections between them. In the absence of a mediated schema, the user formulates structured queries based on what she expects to find. These queries are rewritten using a best-effort approach: the proposed rewriting algorithms compare a user query against the source schemas and produces a set of rewritings based on the matches found. Based on this architecture, two different querying approaches are tested. Experiments show that the indexing and rewriting algorithms are scalable, i.e., able to handle a very large number of structured Web sources; and that support simple, yet expressive queries that exploit the inherent structure of the data.
3

Pay-as-you-go instance-level integration

Maskat, Ruhaila January 2016 (has links)
With the growing demand for information in various domains, sharing of information from heterogeneous data sources is now a necessity. Data integration approaches promise to combine data from these different sources and present to the user a single, unified view of these data. However, although these approaches offer high quality services for the managing and integrating of data, they come with a high cost. This is because a great amount of manual effort to form relationships across data sources is needed to set up the data integration system. A newer variant of data integration, known as dataspaces, aims to spread the large manual effort spent at the start of the data integration system to the rest of the system's phases. This is achieved by soliciting from the user their feedback on a chosen artefact of a dataspace, either by explicit ways or implicitly. This practice is known as pay-as-you-go, where a user continuously pays to the data integration system, by providing feedback, to gain improvements in the quality of data integration. This PhD addresses two challenges in data integration by using pay-as-you-go approaches. The first is to identify instances relevant to a user's information need, calling for semantic mappings to be closely considered. Our contribution is a technique that ranks mappings with the help of implicit user feedback (i.e., terms found in query logs). Our evaluation shows that to produce stable rankings, our technique does not require large-sized query logs, and that our generated ranking is able to respond satisfactorily to the amount of terms inclined towards a particular data source, where we describe it as skew. The second challenge that we address is the identification of duplicate instances from disparate data sources. We contribute a strategy that uses explicitly-obtained user feedback to drive an evolutionary search algorithm to find suitable parameters for an underlying clustering algorithm. Our experiments show that optimising the algorithm's parameters and introducing attribute weights produces fitter clusters than clustering alone. However, our strategy to improve on integration quality can be quite expensive. Therefore, we propose a pruning technique to select from a dataset any records that are informative. Our experiment shows that on most of the datasets, our pruner produce comparably fit clusters with more feedback received.
4

Indexing and querying dataspaces

Mergen, Sérgio Luis Sardi January 2011 (has links)
Over theWeb, distributed and heterogeneous sources with structured and related content form rich repositories of information commonly referred to as dataspaces. To provide access to this heterogeneous data, information integration systems have traditionally relied on the availability of a mediated schema, along with mappings between this schema and the schema of the source schemas. On dataspaces, where sources are plentiful, autonomous and extremely volatile, a system based on the existence of a pre-defined mediated schema and mapping information presents several drawbacks. Notably, the cost of keeping the mappings up to date as new sources are found or existing sources change can be prohibitively high. We propose a novel querying architecture that requires neither a mediated schema nor source mappings, which is based mainly on indexing mechanisms and on-the-fly rewriting algorithms. Our indexes are designed for data that is represented as relations, and are able to capture the structure of the sources, their instances and the connections between them. In the absence of a mediated schema, the user formulates structured queries based on what she expects to find. These queries are rewritten using a best-effort approach: the proposed rewriting algorithms compare a user query against the source schemas and produces a set of rewritings based on the matches found. Based on this architecture, two different querying approaches are tested. Experiments show that the indexing and rewriting algorithms are scalable, i.e., able to handle a very large number of structured Web sources; and that support simple, yet expressive queries that exploit the inherent structure of the data.
5

Indexing and querying dataspaces

Mergen, Sérgio Luis Sardi January 2011 (has links)
Over theWeb, distributed and heterogeneous sources with structured and related content form rich repositories of information commonly referred to as dataspaces. To provide access to this heterogeneous data, information integration systems have traditionally relied on the availability of a mediated schema, along with mappings between this schema and the schema of the source schemas. On dataspaces, where sources are plentiful, autonomous and extremely volatile, a system based on the existence of a pre-defined mediated schema and mapping information presents several drawbacks. Notably, the cost of keeping the mappings up to date as new sources are found or existing sources change can be prohibitively high. We propose a novel querying architecture that requires neither a mediated schema nor source mappings, which is based mainly on indexing mechanisms and on-the-fly rewriting algorithms. Our indexes are designed for data that is represented as relations, and are able to capture the structure of the sources, their instances and the connections between them. In the absence of a mediated schema, the user formulates structured queries based on what she expects to find. These queries are rewritten using a best-effort approach: the proposed rewriting algorithms compare a user query against the source schemas and produces a set of rewritings based on the matches found. Based on this architecture, two different querying approaches are tested. Experiments show that the indexing and rewriting algorithms are scalable, i.e., able to handle a very large number of structured Web sources; and that support simple, yet expressive queries that exploit the inherent structure of the data.
6

Crowdsourcing in pay-as-you-go data integration

Osorno Gutierrez, Fernando January 2016 (has links)
In pay-as-you-go data integration, feedback can inform the regeneration of different aspects of a data integration system, and as a result, helps to improve the system's quality. However, feedback could be expensive as the amount of feedback required to annotate all the possible integration artefacts is potentially big in contexts where the budget can be limited. Also, feedback could be used in different ways. Feedback of different types and in different orders could have different effects in the quality of the integration. Some feedback types could give rise to more benefit than others. There is a need to develop techniques to collect feedback effectively. Previous efforts have explored the benefit of feedback in one aspect of the integration. However, the contributions have not considered the benefit of different feedback types in a single integration task. We have investigated the annotation of mapping results using crowdsourcing, and implementing techniques for reliability. The results indicate that precision estimates derived from crowdsourcing improve rapidly, suggesting that crowdsourcing can be used as a cost-effective source of feedback. We propose an approach to maximize the improvement of data integration systems given a budget for feedback. Our approach takes into account the annotation of schema matchings, mapping results and pairs of candidate record duplicates. We define a feedback plan, which indicates the type of feedback to collect, the amount of feedback to collect and the order in which different types of feedback are collected. We defined a fitness function and a genetic algorithm to search for the most cost-effective feedback plans. We implemented a framework to test the application of feedback plans and measure the improvement of different data integration systems. In the framework, we use a greedy algorithm for the selection of mappings. We designed quality measures to estimate the quality of a dataspace after the application of a feedback plan. For the evaluation of our approach, we propose a method to generate synthetic data scenarios. We evaluate our approach in scenarios with different characteristics. The results showed that the generated feedback plans achieved higher quality values than the randomly generated feedback plans in several scenarios.

Page generated in 0.0673 seconds