Spelling suggestions: "subject:"[een] FEDERATED QUERY"" "subject:"[enn] FEDERATED QUERY""
1 |
Identifying, Relating, Consisting and Querying Large Heterogeneous RDF SourcesVALDESTILHAS, ANDRE 12 January 2021 (has links)
The Linked Data concept relies on a collection of best practices to publish and link structured web-based data. However, the number of available datasets has been growing significantly over the last decades. These datasets are interconnected and now represent the well-known Web of Data, which stands for an extensive collection of concise and detailed interlinked data sets from multiple domains with large datasets. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases becomes an increasing challenge. However, connections between datasets play a leading role in significant activities such as cross-ontology question answering, large-scale inferences, and data integration. In Linked Data, the Linksets are well known for executing the task of generating links between datasets. Due to the heterogeneity of the datasets, this uniqueness is reflected in the structure of the dataset, making a hard task to find relations among those datasets, i.e., to identify how similar they are. In this way, we can say that Linked Data involves Datasets and Linksets and those Linksets needs to be maintained.
Such lack of information directed us to the current issues addressed in this thesis, which are: How to Identify and query datasets from a huge heterogeneous collection of RDF (Resource Description Framework) datasets. To address this issue, we need to assure the consistency and to know how the datasets are related and how similar they are.
As results, to deal with the need for identifying LOD (Linked Open Data) Datasets, we created an approach called WIMU, which is a regularly updated database index of more than 660K datasets from LODStats and LOD Laundromat, an efficient, low cost and scalable service on the web that shows which dataset most likely defines a URI and various statistics of datasets indexed from LODStats and LOD Laundromat. To integrate and to query LOD datasets, we provide a hybrid SPARQL query processing engine that can retrieve results from 559 active SPARQL endpoints (with a total of 163.23 billion triples) and 668,166 datasets (with a total of 58.49 billion triples) from LOD Stats and LOD Laundromat. To assure consistency of semantic web Linked repositories where these LOD datasets are located we create an approach for the mitigation of the identifier heterogeneity problem and implement a prototype where the user can evaluate existing links, as well as suggest new links to be rated and a time-efficient algorithm for the detection of erroneous links in large-scale link repositories without computing all closures required by the property axiom. To know how the datasets are related and how similar they are we provide a String similarity algorithm called Most Frequent K Characters, in which is based in two nested filters, (1) First Frequency Filter and (2) Hash Intersection filter, that allows discarding candidates before calculating the actual similarity value, thus giving a considerable performance gain, allowing to build a LOD Dataset Relation Index, in which provides information about how similar are all the datasets from LOD cloud, including statistics about the current state of those datasets.
The work in this thesis showed that to identify and query LOD datasets, we need to know how those datasets are related, assuring consistency. Our analysis demonstrated that most of the datasets are disconnected from others needing to pass through a consistency and linking process to integrate them, providing a way to query a large number of datasets simultaneously. There is a considerable step towards totally queryable LOD datasets, where the information contained in this thesis is an essential step towards Identifying, Relating, and Querying datasets on the Web of Data.:1 introduction and motivation 1
1.1 The need for identifying and querying LOD datasets . 1
1.2 The need for consistency of semantic web Linked
repositories . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The need for Relation and integration of LOD datasets 2
1.4 Research Questions and Contributions . . . . . . . . . . 3
1.5 Methodology and Contributions . . . . . . . . . . . . . 3
1.6 General Use Cases . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 The Heloise project . . . . . . . . . . . . . . . . . 6
1.7 Chapter overview . . . . . . . . . . . . . . . . . . . . . . 7
2 preliminaries 8
2.1 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 URIs and URLs . . . . . . . . . . . . . . . . . . . 8
2.1.2 Linked Data . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Resource Description Framework . . . . . . . . 10
2.1.4 Ontologies . . . . . . . . . . . . . . . . . . . . . . 11
2.2 RDF graph . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Transitive property . . . . . . . . . . . . . . . . . . . . . 12
2.4 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Linkset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 RDF graph partitioning . . . . . . . . . . . . . . . . . . 13
2.7 Basic Graph Pattern . . . . . . . . . . . . . . . . . . . . . 13
2.8 RDF Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Federated Queries . . . . . . . . . . . . . . . . . . . . . . 14
3 state of the art 15
3.1 Identifying Datasets in Large Heterogeneous RDF
Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Relating Large amount of RDF datasets . . . . . . . . . 19
3.2.1 Obtaining Similar Resources using String Similarity
. . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Consistency on Large amout of RDF sources . . . . . . 21
3.3.1 Heterogeneity in DBpedia Identifiers . . . . . . 21
3.3.2 Detection of Erroneous Links in Large-Scale
RDF Datasets . . . . . . . . . . . . . . . . . . . . 22
3.4 Querying Large Heterogeneous RDF Datasets . . . . . 25
4 relation among large amount of rdf sources 29
4.1 Identifying Datasets in Large Heterogeneous RDF
sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 The WIMU approach . . . . . . . . . . . . . . . . 29
4.1.2 The approach . . . . . . . . . . . . . . . . . . . . 30
4.1.3 Use cases . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.4 Evaluation: Statistics about the Datasets . . . . 35
4.2 Relating RDF sources . . . . . . . . . . . . . . . . . . . . 38
4.2.1 The ReLOD approach . . . . . . . . . . . . . . . 38
4.2.2 The approach . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Relating Similar Resources using String Similarity . . . 50
4.3.1 The MFKC approach . . . . . . . . . . . . . . . . 50
4.3.2 Approach . . . . . . . . . . . . . . . . . . . . . . 51
4.3.3 Correctness and Completeness . . . . . . . . . . 55
4.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . 57
5 consistency in large amount of rdf sources 67
5.1 Consistency in Heterogeneous DBpedia Identifiers . . 67
5.1.1 The DBpediaSameAs approach . . . . . . . . . . 67
5.1.2 Representation of the idea . . . . . . . . . . . . . 68
5.1.3 The work-flow . . . . . . . . . . . . . . . . . . . 69
5.1.4 Methodology . . . . . . . . . . . . . . . . . . . . 69
5.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . 70
5.1.6 Normalization on DBpedia URIs . . . . . . . . . 70
5.1.7 Rate the links . . . . . . . . . . . . . . . . . . . . 71
5.1.8 Results . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.9 Discussion . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Consistency in Large-Scale RDF sources: Detection of
Erroneous Links . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 The CEDAL approach . . . . . . . . . . . . . . . 73
5.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.3 Error Types and Quality Measure for Linkset
Repositories . . . . . . . . . . . . . . . . . . . . . 78
5.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . 80
5.2.5 Experimental setup . . . . . . . . . . . . . . . . . 80
5.3 Detecting Erroneous Link candidates in Educational
Link Repositories . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 The CEDAL education approach . . . . . . . . . 85
5.3.2 Research questions . . . . . . . . . . . . . . . . . 86
5.3.3 Our contributions . . . . . . . . . . . . . . . . . . 86
5.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . 86
6 querying large amount of heterogeneous rdf
datasets 89
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 The WimuQ . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1 Identifying Datasets in Large Heterogeneous RDF
Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Relating Large Amount of RDF Datasets . . . . . . . . 101
7.3 Obtaining Similar Resources Using String Similarity . . 102
7.4 Heterogeneity in DBpedia Identifiers . . . . . . . . . . . 102
7.5 Detection of Erroneous Links in Large-Scale RDF
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.7 Querying Large Heterogeneous RDF Datasets . . . . . 104
|
2 |
[pt] BUSCA POR PALAVRAS-CHAVE SOBRE GRAFOS RDF FEDERADOS EXPLORANDO SEUS ESQUEMAS / [en] KEYWORD SEARCH OVER FEDERATED RDF GRAPHS BY EXPLORING THEIR SCHEMASYENIER TORRES IZQUIERDO 28 July 2017 (has links)
[pt] O Resource Description Framework (RDF) foi adotado como uma recomendação do W3C em 1999 e hoje é um padrão para troca de dados na Web. De fato, uma grande quantidade de dados foi convertida em RDF, muitas vezes em vários conjuntos de dados fisicamente distribuídos ao longo de diferentes localizações. A linguagem de consulta SPARQL (sigla do inglês de SPARQL Protocol and RDF Query Language) foi oficialmente introduzido em 2008 para recuperar dados RDF e fornecer endpoints para consultar fontes distribuídas. Uma maneira alternativa de acessar conjuntos de dados RDF é usar consultas baseadas em palavras-chave, uma área que tem sido extensivamente pesquisada, com foco recente no conteúdo da Web. Esta dissertação descreve uma estratégia para compilar consultas baseadas em palavras-chave em consultas SPARQL federadas sobre conjuntos de dados RDF distribuídos, assumindo que cada conjunto de dados RDF tem um esquema e que a federação tem um esquema mediado. O processo de compilação da consulta SPARQL federada é explicado em detalhe, incluindo como computar o conjunto de joins externos entre as subconsultas locais geradas, como combinar, com a ajuda de cláusulas UNION, os resultados de consultas locais que não têm joins entre elas, e como construir a cláusula TARGET, de acordo com a composição da cláusula WHERE. Finalmente, a dissertação cobre experimentos com dados do mundo real para validar a implementação. / [en] The Resource Description Framework (RDF) was adopted as a W3C recommendation in 1999 and today is a standard for exchanging data in the Web. Indeed, a large amount of data has been converted to RDF, often as multiple datasets physically distributed over different locations. The SPARQL Protocol and RDF Query Language (SPARQL) was officially introduced in 2008 to retrieve RDF datasets and provide endpoints to query distributed sources. An alternative way to access RDF datasets is to use keyword-based queries, an area that has been extensively researched, with a recent focus on Web content. This dissertation describes a strategy to compile keyword-based queries into federated SPARQL queries over distributed RDF datasets, under the assumption that each RDF dataset has a schema and that the federation has a mediated schema. The compilation process of the federated SPARQL query is explained in detail, including how to compute a set of external joins between the local subqueries, how to combine, with the help of the UNION clauses, the results of local queries which have no external joins between them, and how to construct the TARGET clause, according to the structure of the WHERE clause. Finally, the dissertation covers experiments with real-world data to validate the implementation.
|
3 |
Traitement de requêtes SPARQL sur des données liées / SPARQL distributed query processing over linked dataMacina, Abdoul 17 December 2018 (has links)
De plus en plus de sources de données liées sont publiées à travers le Web en s'appuyant sur les technologies du Web sémantique, formant ainsi un large réseau de données distribuées. Cependant il est difficile pour les consommateurs de données de profiter de la richesse de ces données, compte tenu de leur distribution, de l'augmentation de leur volume et de l'autonomie des sources de données. Les moteurs fédérateurs de données permettent d'interroger ces sources de données en utilisant des techniques de traitement de requêtes distribuées. Cependant, une mise en œuvre naïve de ces techniques peut générer un nombre considérable de requêtes distantes et de nombreux résultats intermédiaires entraînant ainsi un long temps de traitement des requêtes et des communications réseau coûteuse. Par ailleurs, la sémantique des requêtes distribuées est souvent ignorée. L'expressivité des requêtes, le partitionnement des données et leur réplication sont d'autres défis auxquels doivent faire face les moteurs de requêtes. Pour répondre à ces défis, nous avons d'abord proposé une sémantique des requêtes distribuées compatible avec les standards SPARQL et RDF qui préserve l’expressivité de SPARQL. Nous avons ensuite présenté plusieurs stratégies d'optimisation pour un moteur de requêtes fédérées qui interroge de manière transparente des sources de données distribuées. La performance de ces optimisations est évaluée sur une implémentation d’un moteur de requêtes distribuées SPARQL / Driven by the Semantic Web standards, an increasing number of RDF data sources are published and connected over the Web by data providers, leading to a large distributed linked data network. However, exploiting the wealth of these data sources is very challenging for data consumers considering the data distribution, their volume growth and data sources autonomy. In the Linked Data context, federation engines allow querying these distributed data sources by relying on Distributed Query Processing (DQP) techniques. Nevertheless, a naive implementation of the DQP approach may generate a tremendous number of remote requests towards data sources and numerous intermediate results, thus leading to costly network communications. Furthermore, the distributed query semantics is often overlooked. Query expressiveness, data partitioning, and data replication are other challenges to be taken into account. To address these challenges, we first proposed in this thesis a SPARQL and RDF compliant Distributed Query Processing semantics which preserves the SPARQL language expressiveness. Afterwards, we presented several strategies for a federated query engine that transparently addresses distributed data sources, while managing data partitioning, query results completeness, data replication, and query processing performance. We implemented and evaluated our approach and optimization strategies in a federated query engine to prove their effectiveness.
|
Page generated in 0.0339 seconds