Since the proposal of hypertext by Tim Berners-Lee to his employer CERN on March 12, 1989 the World Wide Web has grown to more than one billion Web pages and still grows.
With the later proposed Semantic Web vision,Berners-Lee et al. suggested an extension of the existing (Document) Web to allow better reuse, sharing and understanding of data.
Both the Document Web and the Web of Data (which is the current implementation of the Semantic Web) grow continuously. This is a mixed blessing, as the two forms of the Web grow concurrently and most commonly contain different pieces of information. Modern information systems must thus bridge a Semantic Gap to allow a holistic and unified access to information about a particular information independent of the representation of the data.
One way to bridge the gap between the two forms of the Web is the extraction of structured data, i.e., RDF, from the growing amount of unstructured and semi-structured information (e.g., tables and XML) on the Document Web. Note, that unstructured data stands for any type of textual information like news, blogs or tweets.
While extracting structured data from unstructured data allows the development of powerful information system, it requires high-quality and scalable knowledge extraction frameworks to lead to useful results. The dire need for such approaches has led to the development of a multitude of annotation frameworks and tools. However, most of these approaches are not evaluated on the same datasets or using the same measures. The resulting Evaluation Gap needs to be tackled by a concise evaluation framework to foster fine-grained and uniform evaluations of annotation tools and frameworks over any knowledge bases.
Moreover, with the constant growth of data and the ongoing decentralization of knowledge, intuitive ways for non-experts to access the generated data are required. Humans adapted their search behavior to current Web data by access paradigms such as keyword search so as to retrieve high-quality results. Hence, most Web users only expect Web documents in return. However, humans think and most commonly express their information needs in their natural language rather than using keyword phrases. Answering complex information needs often requires the combination of knowledge from various, differently structured data sources. Thus, we observe an Information Gap between natural-language questions and current keyword-based search paradigms, which in addition do not make use of the available structured and unstructured data sources. Question Answering (QA) systems provide an easy and efficient way to bridge this gap by allowing to query data via natural language, thus reducing (1) a possible loss of precision and (2) potential loss of time while reformulating the search intention to transform it into a machine-readable way. Furthermore, QA systems enable answering natural language queries with concise results instead of links to verbose Web documents. Additionally, they allow as well as encourage the access to and the combination of knowledge from heterogeneous knowledge bases (KBs) within one answer.
Consequently, three main research gaps are considered and addressed in this work:
First, addressing the Semantic Gap between the unstructured Document Web and the Semantic Gap requires the development of scalable and accurate approaches for the extraction of structured data in RDF. This research challenge is addressed by several approaches within this thesis. This thesis presents CETUS, an approach for recognizing entity types to populate RDF KBs. Furthermore, our knowledge base-agnostic disambiguation framework AGDISTIS can efficiently detect the correct URIs for a given set of named entities. Additionally, we introduce REX, a Web-scale framework for RDF extraction from semi-structured (i.e., templated) websites which makes use of the semantics of the reference knowledge based to check the extracted data.
The ongoing research on closing the Semantic Gap has already yielded a large number of annotation tools and frameworks. However, these approaches are currently still hard to compare since the published evaluation results are calculated on diverse datasets and evaluated based on different measures. On the other hand, the issue of comparability of results is not to be regarded as being intrinsic to the annotation task. Indeed, it is now well established that scientists spend between 60% and 80% of their time preparing data for experiments. Data preparation being such a tedious problem in the annotation domain is mostly due to the different formats of the gold standards as well as the different data representations across reference datasets.
We tackle the resulting Evaluation Gap in two ways: First, we introduce a collection of three novel datasets, dubbed N3, to leverage the possibility of optimizing NER and NED algorithms via Linked Data and to ensure a maximal interoperability to overcome the need for corpus-specific parsers. Second, we present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools and frameworks on multiple datasets.
The decentral architecture behind the Web has led to pieces of information being distributed across data sources with varying structure. Moreover, the increasing the demand for natural-language interfaces as depicted by current mobile applications requires systems to deeply understand the underlying user information need. In conclusion, the natural language interface for asking questions requires a hybrid approach to data usage, i.e., simultaneously performing a search on full-texts and semantic knowledge bases.
To close the Information Gap, this thesis presents HAWK, a novel entity search approach developed for hybrid QA based on combining structured RDF and unstructured full-text data sources.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa.de:bsz:15-qucosa-225097 |
Date | 22 May 2017 |
Creators | Usbeck, Ricardo |
Contributors | Universität Leipzig, Fakultät für Mathematik und Informatik, Prof. Dr. Klaus-Peter Fähnrich, Prof. Dr. Axel-Cyrille Ngonga Ngomo, Dr. Andreas Both, Prof. Dr. Klaus-Peter Fähnrich, Prof. Dr. Philipp Cimiano |
Publisher | Universitätsbibliothek Leipzig |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | doc-type:doctoralThesis |
Format | application/pdf |
Page generated in 0.0041 seconds