Return to search

Identifying, Relating, Consisting and Querying Large Heterogeneous RDF Sources

The Linked Data concept relies on a collection of best practices to publish and link structured web-based data. However, the number of available datasets has been growing significantly over the last decades. These datasets are interconnected and now represent the well-known Web of Data, which stands for an extensive collection of concise and detailed interlinked data sets from multiple domains with large datasets. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases becomes an increasing challenge. However, connections between datasets play a leading role in significant activities such as cross-ontology question answering, large-scale inferences, and data integration. In Linked Data, the Linksets are well known for executing the task of generating links between datasets. Due to the heterogeneity of the datasets, this uniqueness is reflected in the structure of the dataset, making a hard task to find relations among those datasets, i.e., to identify how similar they are. In this way, we can say that Linked Data involves Datasets and Linksets and those Linksets needs to be maintained.

Such lack of information directed us to the current issues addressed in this thesis, which are: How to Identify and query datasets from a huge heterogeneous collection of RDF (Resource Description Framework) datasets. To address this issue, we need to assure the consistency and to know how the datasets are related and how similar they are.

As results, to deal with the need for identifying LOD (Linked Open Data) Datasets, we created an approach called WIMU, which is a regularly updated database index of more than 660K datasets from LODStats and LOD Laundromat, an efficient, low cost and scalable service on the web that shows which dataset most likely defines a URI and various statistics of datasets indexed from LODStats and LOD Laundromat. To integrate and to query LOD datasets, we provide a hybrid SPARQL query processing engine that can retrieve results from 559 active SPARQL endpoints (with a total of 163.23 billion triples) and 668,166 datasets (with a total of 58.49 billion triples) from LOD Stats and LOD Laundromat. To assure consistency of semantic web Linked repositories where these LOD datasets are located we create an approach for the mitigation of the identifier heterogeneity problem and implement a prototype where the user can evaluate existing links, as well as suggest new links to be rated and a time-efficient algorithm for the detection of erroneous links in large-scale link repositories without computing all closures required by the property axiom. To know how the datasets are related and how similar they are we provide a String similarity algorithm called Most Frequent K Characters, in which is based in two nested filters, (1) First Frequency Filter and (2) Hash Intersection filter, that allows discarding candidates before calculating the actual similarity value, thus giving a considerable performance gain, allowing to build a LOD Dataset Relation Index, in which provides information about how similar are all the datasets from LOD cloud, including statistics about the current state of those datasets.

The work in this thesis showed that to identify and query LOD datasets, we need to know how those datasets are related, assuring consistency. Our analysis demonstrated that most of the datasets are disconnected from others needing to pass through a consistency and linking process to integrate them, providing a way to query a large number of datasets simultaneously. There is a considerable step towards totally queryable LOD datasets, where the information contained in this thesis is an essential step towards Identifying, Relating, and Querying datasets on the Web of Data.:1 introduction and motivation 1
1.1 The need for identifying and querying LOD datasets . 1
1.2 The need for consistency of semantic web Linked
repositories . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The need for Relation and integration of LOD datasets 2
1.4 Research Questions and Contributions . . . . . . . . . . 3
1.5 Methodology and Contributions . . . . . . . . . . . . . 3
1.6 General Use Cases . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 The Heloise project . . . . . . . . . . . . . . . . . 6
1.7 Chapter overview . . . . . . . . . . . . . . . . . . . . . . 7
2 preliminaries 8
2.1 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 URIs and URLs . . . . . . . . . . . . . . . . . . . 8
2.1.2 Linked Data . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Resource Description Framework . . . . . . . . 10
2.1.4 Ontologies . . . . . . . . . . . . . . . . . . . . . . 11
2.2 RDF graph . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Transitive property . . . . . . . . . . . . . . . . . . . . . 12
2.4 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Linkset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 RDF graph partitioning . . . . . . . . . . . . . . . . . . 13
2.7 Basic Graph Pattern . . . . . . . . . . . . . . . . . . . . . 13
2.8 RDF Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Federated Queries . . . . . . . . . . . . . . . . . . . . . . 14
3 state of the art 15
3.1 Identifying Datasets in Large Heterogeneous RDF
Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Relating Large amount of RDF datasets . . . . . . . . . 19
3.2.1 Obtaining Similar Resources using String Similarity
. . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Consistency on Large amout of RDF sources . . . . . . 21
3.3.1 Heterogeneity in DBpedia Identifiers . . . . . . 21
3.3.2 Detection of Erroneous Links in Large-Scale
RDF Datasets . . . . . . . . . . . . . . . . . . . . 22
3.4 Querying Large Heterogeneous RDF Datasets . . . . . 25
4 relation among large amount of rdf sources 29
4.1 Identifying Datasets in Large Heterogeneous RDF
sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 The WIMU approach . . . . . . . . . . . . . . . . 29
4.1.2 The approach . . . . . . . . . . . . . . . . . . . . 30
4.1.3 Use cases . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.4 Evaluation: Statistics about the Datasets . . . . 35
4.2 Relating RDF sources . . . . . . . . . . . . . . . . . . . . 38
4.2.1 The ReLOD approach . . . . . . . . . . . . . . . 38
4.2.2 The approach . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Relating Similar Resources using String Similarity . . . 50
4.3.1 The MFKC approach . . . . . . . . . . . . . . . . 50
4.3.2 Approach . . . . . . . . . . . . . . . . . . . . . . 51
4.3.3 Correctness and Completeness . . . . . . . . . . 55
4.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . 57
5 consistency in large amount of rdf sources 67
5.1 Consistency in Heterogeneous DBpedia Identifiers . . 67
5.1.1 The DBpediaSameAs approach . . . . . . . . . . 67
5.1.2 Representation of the idea . . . . . . . . . . . . . 68
5.1.3 The work-flow . . . . . . . . . . . . . . . . . . . 69
5.1.4 Methodology . . . . . . . . . . . . . . . . . . . . 69
5.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . 70
5.1.6 Normalization on DBpedia URIs . . . . . . . . . 70
5.1.7 Rate the links . . . . . . . . . . . . . . . . . . . . 71
5.1.8 Results . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.9 Discussion . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Consistency in Large-Scale RDF sources: Detection of
Erroneous Links . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 The CEDAL approach . . . . . . . . . . . . . . . 73
5.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.3 Error Types and Quality Measure for Linkset
Repositories . . . . . . . . . . . . . . . . . . . . . 78
5.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . 80
5.2.5 Experimental setup . . . . . . . . . . . . . . . . . 80
5.3 Detecting Erroneous Link candidates in Educational
Link Repositories . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 The CEDAL education approach . . . . . . . . . 85
5.3.2 Research questions . . . . . . . . . . . . . . . . . 86
5.3.3 Our contributions . . . . . . . . . . . . . . . . . . 86
5.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . 86
6 querying large amount of heterogeneous rdf
datasets 89
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 The WimuQ . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1 Identifying Datasets in Large Heterogeneous RDF
Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Relating Large Amount of RDF Datasets . . . . . . . . 101
7.3 Obtaining Similar Resources Using String Similarity . . 102
7.4 Heterogeneity in DBpedia Identifiers . . . . . . . . . . . 102
7.5 Detection of Erroneous Links in Large-Scale RDF
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.7 Querying Large Heterogeneous RDF Datasets . . . . . 104

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:73293
Date12 January 2021
CreatorsVALDESTILHAS, ANDRE
ContributorsUniversität Leipzig
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0029 seconds