The Semantic Web, a Web of Data, is an extension of the World Wide Web (WWW), a Web of Documents. A large amount of such data is freely available as Linked Open Data (LOD) for many areas of knowledge, forming the LOD Cloud. While this data conforms to the Resource Description Framework (RDF) and can thus be processed by machines, users need to master a formal query language and learn a specific vocabulary. Semantic Question Answering (SQA) systems remove those access barriers by letting the user ask natural language questions that the systems translate into formal queries. Thus, the research area of SQA plays an important role for the acceptance and benefit of the Semantic Web.
The original contributions of this thesis to SQA are: First, we survey the current state of the art of SQA. We complement existing surveys by systematically identifying SQA publications in the chosen timeframe. 72 publications describing 62 different systems are systematically and manually selected using predefined inclusion and exclusion criteria out of 1960 candidates from the end of 2010 to July 2015. The survey identifies common challenges, structured solutions, and recommendations on research opportunities for future systems.
From that point on, we focus on multidimensional numerical data, which is immensely valuable as it influences decisions in health care, policy and finance, among others. With the growth of the open data movement, more and more of it is becoming freely available. A large amount of such data is included in the LOD cloud using the RDF Data Cube (RDC) vocabulary. However, consuming multidimensional numerical data requires experts and specialized tools.
Traditional SQA systems cannot process RDCs because their meta-structure is opaque to applications that expect facts to be encoded in single triples, This motivates our second contribution, the design and implementation of the first SQA algorithm on RDF Data Cubes. We kick-start this new research subfield by creating a user question corpus and a benchmark over multiple data sets. The evaluation of our system on the benchmark, which is included in the public Question Answering over Linked Data (QALD) challenge of 2016, shows the feasibility of the approach, but also highlights challenges, which we discuss in detail as a starting point for future work in the field.
The benchmark is based on our final contribution, the addition of 955 financial government spending data sets to the LOD cloud by transforming data sets of the OpenSpending project to RDF Data Cubes. Open spending data has the power to reduce corruption by increasing accountability and strengthens democracy because voters can make better informed decisions. An informed and trusting public also strengthens the government itself because it is more likely to commit to large projects. OpenSpending.org is an open platform that provides public finance data from governments around the world. The transformation result, called LinkedSpending, consists of more than five million planned and carried out financial transactions in 955 data sets from all over the world as Linked Open Data and is freely available and openly licensed.:1 Introduction
1.1 Motivation
1.2 Research Questions and Contributions
1.3 Thesis Structure
2 Preliminaries
2.1 Semantic Web
2.1.1 URIs and URLs
2.1.2 Linked Data
2.1.3 Resource Description Framework
2.1.4 Ontologies
2.2 Question Answering
2.2.1 History
2.2.2 Definitions
2.2.3 Evaluation
2.2.4 SPARQL
2.2.5 Controlled Vocabulary
2.2.6 Faceted Search
2.2.7 Keyword Search
2.3 Data Cubes
3 Related Work
3.1 Semantic Question Answering
3.1.1 Surveys
3.1.2 Evaluation Campaigns
3.1.3 System Frameworks
3.2 Question Answering on RDF Data Cubes
3.3 RDF Data Cube Data Sets
4 Systematic Survey of Semantic Question Answering
4.1 Methodology
4.1.1 Inclusion Criteria
4.1.2 Exclusion Criteria
4.1.3 Result
4.2 Systems
4.2.1 Implementation
4.2.2 Examples
4.2.3 Answer Presentation
4.3 Challenges
4.3.1 Lexical Gap
4.3.2 Ambiguity
4.3.3 Multilingualism
4.3.4 Complex Queries
4.3.5 Distributed Knowledge
4.3.6 Procedural, Temporal and Spatial Questions
4.3.7 Templates
5 Question Answering on RDF Data Cubes
5.1 Question Corpus
5.2 Corpus Analysis
5.3 Data Cube Operations
5.4 Algorithm
5.4.1 Preprocessing
5.4.2 Matching
5.4.3 Combining Matches to Constraints
5.4.4 Execution
6 LinkedSpending
6.1 Choice of Source Data
6.1.1 Government Spending
6.1.2 OpenSpending
6.2 OpenSpending Source Data
6.3 Conversion of OpenSpending to RDF
6.4 Publishing
6.5 Overview over the Data Sets
6.6 Data Set Quality Analysis
6.6.1 Intrinsic Dimensions
6.6.2 Representational Dimensions
6.7 Evaluation
6.7.1 Experimental Setup and Benchmark
6.7.2 Discussion
7 Conclusion
7.1 Research Question Summary
7.2 SQA Survey
7.2.1 Lexical Gap
7.2.2 Ambiguity
7.2.3 Multilingualism
7.2.4 Complex Operators
7.2.5 Distributed Knowledge
7.2.6 Procedural, Temporal and Spatial Data
7.2.7 Templates
7.2.8 Future Research
7.3 CubeQA
7.4 LinkedSpending
7.4.1 Shortcomings
7.4.2 Future Work
Bibliography
Appendix A The CubeQA Question Corpus
Appendix B The QALD-6 Task 3 Benchmark Questions
B.1 Training Data
B.2 Testing Data
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:74242 |
Date | 26 March 2021 |
Creators | Höffner, Konrad |
Contributors | Universität Leipzig |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/acceptedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0141 seconds