Global ETD Search

71	Question Answering auf dem Lehrbuch 'Health Information Systems' mit Hilfe von unüberwachtem Training eines Pretrained Transformers Keller, Paul 27 November 2023 (has links) Die Extraktion von Wissen aus Büchern ist essentiell und komplex. Besonders in der Medizininformatik ist ein einfacher und vollständiger Zugang zu Wissen wichtig. In dieser Arbeit wurde ein vortrainiertes Sprachmodell verwendet, um den Inhalt des Buches Health Information Systems von Winter u. a. (2023) effizienter und einfacher zugänglich zu machen. Während des Trainings wurde die Qualität des Modells zu verschiedenen Zeitpunkten evaluiert. Dazu beantwortete das Modell Prüfungsfragen aus dem Buch und aus Modulen der Universität Leipzig, die inhaltlich auf dem Buch aufbauen. Abschließend wurde ein Vergleich zwischen den Trainingszeitpunkten, dem nicht weiter trainierten Modell und dem Stand der Technik Modell GPT4 durchgeführt. Mit einem MakroF1-Wert von 0,7 erreichte das Modell GPT4 die höchste Korrektheit bei der Beantwortung der Klausurfragen. Diese Leistung konnte von den anderen Modellen nicht erreicht werden. Allerdings stieg die Leistung von einem anfänglichen MakroF1-Wert von 0,13 durch kontinuierliches Training auf 0,33. Die Ergebnisse zeigen eine deutliche Leistungssteigerung durch diesen Ansatz und bieten eine Grundlage für zukünftige Erweiterungen. Damit ist die Machbarkeit der Beantwortung von Fragen zu Informationssystemen im Gesundheitswesen und der Lösung einer Beispielklausur mit Hilfe von weiter trainierten Sprachmodellen gezeigt, eine praktische Anwendung erreichen diese Modelle jedoch nicht, da sowohl die Leistung unter dem aktuellen Stand der Technik liegt als auch die hier vorgestellten Modelle einen Großteil der gestellten Fragen nicht vollständig korrekt beantworten können.:1 Einleitung 1.1 Gegenstand 1.2 Problemstellung 1.3 Motivation 1.4 Zielsetzung 1.5 Bezug zu ethischen Leitlinien der GMDS 1.6 Aufgabenstellung 1.7 Aufbau der Arbeit 2 Grundlagen 9 2.1 Sprachmodelle 2.1.1 Transformer-Modelle 2.1.2 Transformer-spezifische Architekturen 2.1.3 Eigenheiten von Transformer-Modellen 2.1.4 Eingaben von Transformer-Modellen 2.2 Neuronale Netze 2.2.1 Architektur 2.2.2 Funktionsweise 2.2.3 Training 2.3 Datenverarbeitung 2.3.1 Glossar der Daten 3 Stand der Forschung 3.1 Continual Pretraining 3.2 Aktuelle Modelle und deren Nutzbarkeit 3.3 Forschung und Probleme von Modellen 4 Lösungsansatz 4.1 Auswahl von Sprachmodellen 4.2 Datenkuration 4.2.1 Extraktion des Textes 4.2.2 Unverständliche Formate 4.2.3 Textpassagen ohne Wissen oder Kontext 4.2.4 Optionale Textentfernungen 4.2.5 Bleibende Texte 4.2.6 Formatierung von Text 4.2.7 Potentielle Extraktion von Fragen 4.3 Unüberwachtes Weitertrainieren 4.3.1 Ausführen der Training-Programme 4.4 Klausurfragen 4.5 Modellevaluation 5 Ausführung der Lösung 5.1 Herunterladen des Modells 5.2 Training des Modells 5.2.1 Konfiguration des Modells 5.2.2 Konfiguration der Trainingsdaten 5.2.3 Konfiguration des Trainings 5.2.4 Konfiguration des DeepSpeed Trainings 5.2.5 Verwendete Bibliotheken zum Training 5.2.6 Training auf einem GPU Computing Cluster 5.2.7 Probleme während des Trainings 5.3 Generierung von Antworten 5.3.1 Erstellung des Evaluierungsdatensatzes 5.4 Bewertung der generierten Antworten 5.5 Evaluation der Modelle 5.5.1 Kriterium: Korrektheit 5.5.2 Kriterium: Erklärbarkeit 5.5.3 Kriterium: Fragenverständnis 5.5.4 Kriterium: Robustheit 6 Ergebnisse 6.1 Analyse Korrektheit 6.1.1 Vergleich totaler Zahlen 6.1.2 Stärken und Schwächen der Modelle 6.1.3 Verbesserungen durch Training 6.1.4 Vergleich MakroF1 6.1.5 Zusammenfassung 6.2 Analyse Erklärbarkeit 6.3 Analyse Fragenverständnis 6.4 Analyse Robustheit 6.5 Zusammenfassung 7 Diskussion 7.1 Grenzen der Modelle 7.2 Probleme bei Kernfragen 7.3 Bewertung der Fragen mit Prüfungspunkten 7.4 Lösung des Problems 8 Ausblick 8.1 Modellvergrößerung 8.1.1 Training durch Quantisierung 8.2 Human Reinforcement Learning 8.3 Datensatzvergrößerung 8.4 Domänenspezifische Modelle 8.5 Adapter-basiertes Training 8.6 Textextraktion aus Kontext 8.7 Retrieval Augmented Generation 8.8 Zusammenfassung Zusammenfassung info:eu-repo/classification/ddc/004 ddc:004
72	Grounded and Consistent Question Answering Alberti, Christopher Brian January 2023 (has links) This thesis describes advancements in question answering along three general directions: model architecture extensions, explainable question answering, and data augmentation. Chapter 2 describes the first state-of-the-art model for the Natural Questions dataset based on pretrained transformers. Chapters 3 and 4 describe extensions to the model architecture designed to accommodate long textual inputs and multimodal text+image inputs, establishing new state-of-the-art results on the Natural Questions and on the VCR dataset. Chapter 5 shows that significant improvements can be obtained with data augmentation on the SQuAD and Natural Questions dataset, introducing roundtrip consistency as a simple heuristic to improve the quality of synthetic data. In Chapters 6 and 7 we explore explainable question answering, demonstrating the usefulness of a new concrete kind of structured explanations, QED, and proposing a semantic analysis of why-questions in the Natural Questions, as a way of better understanding the nature of real world explanations. Finally, in Chapters 8 and 9 we delve into more exploratory data augmentation techniques for question answering. We look respectively at how straight-through gradients can be utilized to optimize roundtrip consistency in a pipeline of models on the fly, and at how very recent large language models like PaLM can be used to generate synthetic question answering datasets for new languages given as few as five representative examples per language. Computer science Question-answering systems Computational linguistics
73	Transfer Learning and Attention Mechanisms in a Multimodal Setting Greco, Claudio 13 May 2022 (has links) Humans are able to develop a solid knowledge of the world around them: they can leverage information coming from different sources (e.g., language, vision), focus on the most relevant information from the input they receive in a given life situation, and exploit what they have learned before without forgetting it. In the field of Artificial Intelligence and Computational Linguistics, replicating these human abilities in artificial models is a major challenge. Recently, models based on pre-training and on attention mechanisms, namely pre-trained multimodal Transformers, have been developed. They seem to perform tasks surprisingly well compared to other computational models in multiple contexts. They simulate a human-like cognition in that they supposedly rely on previously acquired knowledge (transfer learning) and focus on the most important information (attention mechanisms) of the input. Nevertheless, we still do not know whether these models can deal with multimodal tasks that require merging different types of information simultaneously to be solved, as humans would do. This thesis attempts to fill this crucial gap in our knowledge of multimodal models by investigating the ability of pre-trained Transformers to encode multimodal information; and the ability of attention-based models to remember how to deal with previously-solved tasks. With regards to pre-trained Transformers, we focused on their ability to rely on pre-training and on attention while dealing with tasks requiring to merge information coming from language and vision. More precisely, we investigate if pre-trained multimodal Transformers are able to understand the internal structure of a dialogue (e.g., organization of the turns); to effectively solve complex spatial questions requiring to process different spatial elements (e.g., regions of the image, proximity between elements, etc.); and to make predictions based on complementary multimodal cues (e.g., guessing the most plausible action by leveraging the content of a sentence and of an image). The results of this thesis indicate that pre-trained Transformers outperform other models. Indeed, they are able to some extent to integrate complementary multimodal information; they manage to pinpoint both the relevant turns in a dialogue and the most important regions in an image. These results suggest that pre-training and attention play a key role in pre-trained Transformers’ encoding. Nevertheless, their way of processing information cannot be considered as human-like. Indeed, when compared to humans, they struggle (as non-pre-trained models do) to understand negative answers, to merge spatial information in difficult questions, and to predict actions based on complementary linguistic and visual cues. With regards to attention-based models, we found out that these kinds of models tend to forget what they have learned in previously-solved tasks. However, training these models on easy tasks before more complex ones seems to mitigate this catastrophic forgetting phenomenon. These results indicate that, at least in this context, attention-based models (and, supposedly, pre-trained Transformers too) are sensitive to tasks’ order. A better control of this variable may therefore help multimodal models learn sequentially and continuously as humans do. Settore INF/01 - Informatica
74	Automatic Question Answering and Knowledge Discovery from Electronic Health Records Wang, Ping 25 August 2021 (has links) Electronic Health Records (EHR) data contain comprehensive longitudinal patient information, which is usually stored in databases in the form of either multi-relational structured tables or unstructured texts, e.g., clinical notes. EHR provides a useful resource to assist doctors' decision making, however, they also present many unique challenges that limit the efficient use of the valuable information, such as large data volume, heterogeneous and dynamic information, medical term abbreviations, and noisy nature caused by misspelled words. This dissertation focuses on the development and evaluation of advanced machine learning algorithms to solve the following research questions: (1) How to seek answers from EHR for clinical activity related questions posed in human language without the assistance of database and natural language processing (NLP) domain experts, (2) How to discover underlying relationships of different events and entities in structured tabular EHRs, and (3) How to predict when a medical event will occur and estimate its probability based on previous medical information of patients. First, to automatically retrieve answers for natural language questions from the structured tables in EHR, we study the question-to-SQL generation task by generating the corresponding SQL query of the input question. We propose a translation-edit model driven by a language generation module and an editing module for the SQL query generation task. This model helps automatically translate clinical activity related questions to SQL queries, so that the doctors only need to provide their questions in natural language to get the answers they need. We also create a large-scale dataset for question answering on tabular EHR to simulate a more realistic setting. Our performance evaluation shows that the proposed model is effective in handling the unique challenges about clinical terminologies, such as abbreviations and misspelled words. Second, to automatically identify answers for natural language questions from unstructured clinical notes in EHR, we propose to achieve this goal by querying a knowledge base constructed based on fine-grained document-level expert annotations of clinical records for various NLP tasks. We first create a dataset for clinical knowledge base question answering with two sets: clinical knowledge base and question-answer pairs. An attention-based aspect-level reasoning model is developed and evaluated on the new dataset. Our experimental analysis shows that it is effective in identifying answers and also allows us to analyze the impact of different answer aspects in predicting correct answers. Third, we focus on discovering underlying relationships of different entities (e.g., patient, disease, medication, and treatment) in tabular EHR, which can be formulated as a link prediction problem in graph domain. We develop a self-supervised learning framework for better representation learning of entities across a large corpus and also consider local contextual information for the down-stream link prediction task. We demonstrate the effectiveness, interpretability, and scalability of the proposed model on the healthcare network built from tabular EHR. It is also successfully applied to solve link prediction problems in a variety of domains, such as e-commerce, social networks, and academic networks. Finally, to dynamically predict the occurrence of multiple correlated medical events, we formulate the problem as a temporal (multiple time-points) and multi-task learning problem using tensor representation. We propose an algorithm to jointly and dynamically predict several survival problems at each time point and optimize it with the Alternating Direction Methods of Multipliers (ADMM) algorithm. The model allows us to consider both the dependencies between different tasks and the correlations of each task at different time points. We evaluate the proposed model on two real-world applications and demonstrate its effectiveness and interpretability. / Doctor of Philosophy / Healthcare is an important part of our lives. Due to the recent advances of data collection and storing techniques, a large amount of medical information is generated and stored in Electronic Health Records (EHR). By comprehensively documenting the longitudinal medical history information about a large patient cohort, this EHR data forms a fundamental resource in assisting doctors' decision making including optimization of treatments for patients and selection of patients for clinical trials. However, EHR data also presents a number of unique challenges, such as (i) large-scale and dynamic data, (ii) heterogeneity of medical information, and (iii) medical term abbreviation. It is difficult for doctors to effectively utilize such complex data collected in a typical clinical practice. Therefore, it is imperative to develop advanced methods that are helpful for efficient use of EHR and further benefit doctors in their clinical decision making. This dissertation focuses on automatically retrieving useful medical information, analyzing complex relationships of medical entities, and detecting future medical outcomes from EHR data. In order to retrieve information from EHR efficiently, we develop deep learning based algorithms that can automatically answer various clinical questions on structured and unstructured EHR data. These algorithms can help us understand more about the challenges in retrieving information from different data types in EHR. We also build a clinical knowledge graph based on EHR and link the distributed medical information and further perform the link prediction task, which allows us to analyze the complex underlying relationships of various medical entities. In addition, we propose a temporal multi-task survival analysis method to dynamically predict multiple medical events at the same time and identify the most important factors leading to the future medical events. By handling these unique challenges in EHR and developing suitable approaches, we hope to improve the efficiency of information retrieval and predictive modeling in healthcare. Electronic Health Records Question Answering Knowledge Discovery Knowledge Graph Survival Analysis
75	Building a Trustworthy Question Answering System for Covid-19 Tracking Liu, Yiqing 02 September 2021 (has links) During the unprecedented global pandemic of Covid-19, the general public is suffering from inaccurate Covid-19 related information including outdated information and fake news. The most used media: TV, social media, newspaper, and radio are incompetent in providing certitude and flash updates that people are seeking. In order to cope with this challenge, several public data resources that are dedicated to providing Covid-19 information were born. They rallied with experts from different fields to provide authoritative and up-to-date pandemic updates. However, the general public cannot still make complete use of such resources since the learning curve is too steep, especially for the aged and under-aged users. To address this problem, in this Thesis, we propose a question answering system that can be interacted with using simple natural language-based sentences. While building this system, we investigate qualified public data resources and from the data content they are providing, and we collect a set of frequently asked questions for Covid-19 tracking. We further build a dedicated dataset named CovidQA for evaluating the performance of the question answering system with different models. Based on the new dataset, we assess multiple machine learning-based models that are built for retrieving relevant information from databases, and then propose two empirical models which utilize the pre-defined templates to generate SQL queries. In our experiments, we demonstrate both quantitative and qualitative results and provide a comprehensive comparison between different types of methods. The results show that the proposed template-based methods are simple but effective in building question answering systems for specific domain problems. / Master of Science / During the unprecedented global pandemic of Covid-19, the general public is suffering from inaccurate Covid-19 related information including outdated information and fake news. The most used media: TV, social media, newspaper, and radio are incompetent in providing certitude and flash updates that people are seeking. In order to cope with this challenge, several public data resources that are dedicated to providing Covid-19 information were born. They rallied with experts from different fields to provide authoritative and up-to-date pandemic updates. However, there is room for improvement in terms of user experience. To address this problem, in this Thesis, we propose a system that can be interacted with using natural questions. While building this system, we evaluate and choose six qualified public data providers as the data sources. We further build a testing dataset for evaluating the performance of the system. We assess two Artificial Intelligence-powered models for the system, and then propose two rule-based models for the researched problem. In our experiments, we provide a comprehensive comparison between different types of methods. The results show that the proposed rule-based methods are simple but effective in building such systems. Information Retrieval Question Answering Database Machine Learning Natural Language Processing Healthcare Covid-19 Dashboard
76	Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks Lin, Xiao 05 October 2017 (has links) Learning and reasoning with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task (e.g., textual reasoning, matching images with captions), our system first represents input images and text in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input images and text. And then based on those perspectives, the system performs reasoning to make a joint prediction for the target task. Surprisingly, we show that interpreting textual assertions and scene descriptions in the modality of abstract scenes improves performance on various textual reasoning tasks, and interpreting images in the modality of Visual Question Answering improves performance on caption retrieval, which is a visual reasoning task. With grounding, imagination and question-answering approaches to interpret images and text in different modalities, we show that learning commonsense knowledge from multiple modalities effectively improves the performance of downstream vision and language tasks, improves interpretability of the model and is able to make more efficient use of training data. Complementary to the model aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA) where a model iteratively grows its knowledge through querying informative questions about images for answers. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven scoring function for deep VQA models under the Bayesian Neural Network framework. Once trained with a large initial training set, a deep VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations. / Ph. D. / Designing systems that learn and reason with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task, our system first represents the input information (e.g., images and text) in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input information. Based on those perspectives, the system performs reasoning to make a joint prediction to solve the target task. Perhaps surprisingly, we show that imagining (generating) abstract scenes behind input textual scene descriptions improves performance on various textual reasoning tasks such as answering fill-in-the-blank and paraphrasing questions, and answering questions about images improves performance on retrieving image captions. Through the use of perspectives from multiple modalities, our system also makes use of training data more efficiently and has a reasoning process that is easy to understand. Complementary to the system design aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA). VQA is the task of answering open-ended natural language questions about images. In active learning for VQA, a model iteratively grows its knowledge through querying informative questions about images for answers. Inspired by human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven query selection function. We show that once initialized with a large training set, a VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations. Common Sense Multimodal Visual Question Answering Image-Caption Ranking Vision and Language Active Learning
77	Identifying reputation collectors in community question answering (CQA) sites: Exploring the dark side of social media Roy, P.K., Singh, J.P., Baabdullah, A.M., Kizgin, Hatice, Rana, Nripendra P. 08 August 2019 (has links) Yes / This research aims to identify users who are posting as well as encouraging others to post low-quality and duplicate contents on community question answering sites. The good guys called Caretakers and the bad guys called Reputation Collectors are characterised by their behaviour, answering pattern and reputation points. The proposed system is developed and analysed over publicly available Stack Exchange data dump. A graph based methodology is employed to derive the characteristic of Reputation Collectors and Caretakers. Results reveal that Reputation Collectors are primary sources of low-quality answers as well as answers to duplicate questions posted on the site. The Caretakers answer limited questions of challenging nature and fetches maximum reputation against those questions whereas Reputation Collectors answers have so many low-quality and duplicate questions to gain the reputation point. We have developed algorithms to identify the Caretakers and Reputation Collectors of the site. Our analysis finds that 1.05% of Reputation Collectors post 18.88% of low quality answers. This study extends previous research by identifying the Reputation Collectors and 2 how they collect their reputation points. Community Question Answering CQA Reputation collectors Expert users Stack exchange Data analysis
78	Recommending best answer in a collaborative question answering system Chen, Lin January 2009 (has links) The World Wide Web has become a medium for people to share information. People use Web-based collaborative tools such as question answering (QA) portals, blogs/forums, email and instant messaging to acquire information and to form online-based communities. In an online QA portal, a user asks a question and other users can provide answers based on their knowledge, with the question usually being answered by many users. It can become overwhelming and/or time/resource consuming for a user to read all of the answers provided for a given question. Thus, there exists a need for a mechanism to rank the provided answers so users can focus on only reading good quality answers. The majority of online QA systems use user feedback to rank users’ answers and the user who asked the question can decide on the best answer. Other users who didn’t participate in answering the question can also vote to determine the best answer. However, ranking the best answer via this collaborative method is time consuming and requires an ongoing continuous involvement of users to provide the needed feedback. The objective of this research is to discover a way to recommend the best answer as part of a ranked list of answers for a posted question automatically, without the need for user feedback. The proposed approach combines both a non-content-based reputation method and a content-based method to solve the problem of recommending the best answer to the user who posted the question. The non-content method assigns a score to each user which reflects the users’ reputation level in using the QA portal system. Each user is assigned two types of non-content-based reputations cores: a local reputation score and a global reputation score. The local reputation score plays an important role in deciding the reputation level of a user for the category in which the question is asked. The global reputation score indicates the prestige of a user across all of the categories in the QA system. Due to the possibility of user cheating, such as awarding the best answer to a friend regardless of the answer quality, a content-based method for determining the quality of a given answer is proposed, alongside the non-content-based reputation method. Answers for a question from different users are compared with an ideal (or expert) answer using traditional Information Retrieval and Natural Language Processing techniques. Each answer provided for a question is assigned a content score according to how well it matched the ideal answer. To evaluate the performance of the proposed methods, each recommended best answer is compared with the best answer determined by one of the most popular link analysis methods, Hyperlink-Induced Topic Search (HITS). The proposed methods are able to yield high accuracy, as shown by correlation scores: Kendall correlation and Spearman correlation. The reputation method outperforms the HITS method in terms of recommending the best answer. The inclusion of the reputation score with the content score improves the overall performance, which is measured through the use of Top-n match scores. Yahoo! Answers
79	Odpovídání na otázky nad strukturovanými daty / Question Answering over Structured Data Birger, Mark January 2017 (has links) Tato práce se zabývá problematikou odpovídání na otázky nad strukturovanými daty. Ve většině případů jsou strukturovaná data reprezentována pomocí propojených grafů, avšak ukrytí koncové struktury dát je podstatné pro využití podobných systémů jako součástí rozhraní s přirozeným jazykem. Odpovídající systém byl navržen a vyvíjen v rámci této práce. V porovnání s tradičními odpovídajícími systémy, které jsou založené na lingvistické analýze nebo statistických metodách, náš systém zkoumá poskytnutý graf a ve výsledků generuje sémantické vazby na základě vstupních párů otázka-odpověd'. Vyvíjený systém je nezávislý na struktuře dát, ale pro účely vyhodnocení jsme využili soubor dát z Wikidata a DBpedia. Kvalita výsledného systému a zkoumaného přístupu byla vyhodnocena s využitím připraveného datasetu a standartních metrik.
80	Knowledge Extraction for Hybrid Question Answering Usbeck, Ricardo 18 May 2017 (has links) Since the proposal of hypertext by Tim Berners-Lee to his employer CERN on March 12, 1989 the World Wide Web has grown to more than one billion Web pages and still grows. With the later proposed Semantic Web vision,Berners-Lee et al. suggested an extension of the existing (Document) Web to allow better reuse, sharing and understanding of data. Both the Document Web and the Web of Data (which is the current implementation of the Semantic Web) grow continuously. This is a mixed blessing, as the two forms of the Web grow concurrently and most commonly contain different pieces of information. Modern information systems must thus bridge a Semantic Gap to allow a holistic and unified access to information about a particular information independent of the representation of the data. One way to bridge the gap between the two forms of the Web is the extraction of structured data, i.e., RDF, from the growing amount of unstructured and semi-structured information (e.g., tables and XML) on the Document Web. Note, that unstructured data stands for any type of textual information like news, blogs or tweets. While extracting structured data from unstructured data allows the development of powerful information system, it requires high-quality and scalable knowledge extraction frameworks to lead to useful results. The dire need for such approaches has led to the development of a multitude of annotation frameworks and tools. However, most of these approaches are not evaluated on the same datasets or using the same measures. The resulting Evaluation Gap needs to be tackled by a concise evaluation framework to foster fine-grained and uniform evaluations of annotation tools and frameworks over any knowledge bases. Moreover, with the constant growth of data and the ongoing decentralization of knowledge, intuitive ways for non-experts to access the generated data are required. Humans adapted their search behavior to current Web data by access paradigms such as keyword search so as to retrieve high-quality results. Hence, most Web users only expect Web documents in return. However, humans think and most commonly express their information needs in their natural language rather than using keyword phrases. Answering complex information needs often requires the combination of knowledge from various, differently structured data sources. Thus, we observe an Information Gap between natural-language questions and current keyword-based search paradigms, which in addition do not make use of the available structured and unstructured data sources. Question Answering (QA) systems provide an easy and efficient way to bridge this gap by allowing to query data via natural language, thus reducing (1) a possible loss of precision and (2) potential loss of time while reformulating the search intention to transform it into a machine-readable way. Furthermore, QA systems enable answering natural language queries with concise results instead of links to verbose Web documents. Additionally, they allow as well as encourage the access to and the combination of knowledge from heterogeneous knowledge bases (KBs) within one answer. Consequently, three main research gaps are considered and addressed in this work: First, addressing the Semantic Gap between the unstructured Document Web and the Semantic Gap requires the development of scalable and accurate approaches for the extraction of structured data in RDF. This research challenge is addressed by several approaches within this thesis. This thesis presents CETUS, an approach for recognizing entity types to populate RDF KBs. Furthermore, our knowledge base-agnostic disambiguation framework AGDISTIS can efficiently detect the correct URIs for a given set of named entities. Additionally, we introduce REX, a Web-scale framework for RDF extraction from semi-structured (i.e., templated) websites which makes use of the semantics of the reference knowledge based to check the extracted data. The ongoing research on closing the Semantic Gap has already yielded a large number of annotation tools and frameworks. However, these approaches are currently still hard to compare since the published evaluation results are calculated on diverse datasets and evaluated based on different measures. On the other hand, the issue of comparability of results is not to be regarded as being intrinsic to the annotation task. Indeed, it is now well established that scientists spend between 60% and 80% of their time preparing data for experiments. Data preparation being such a tedious problem in the annotation domain is mostly due to the different formats of the gold standards as well as the different data representations across reference datasets. We tackle the resulting Evaluation Gap in two ways: First, we introduce a collection of three novel datasets, dubbed N3, to leverage the possibility of optimizing NER and NED algorithms via Linked Data and to ensure a maximal interoperability to overcome the need for corpus-specific parsers. Second, we present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools and frameworks on multiple datasets. The decentral architecture behind the Web has led to pieces of information being distributed across data sources with varying structure. Moreover, the increasing the demand for natural-language interfaces as depicted by current mobile applications requires systems to deeply understand the underlying user information need. In conclusion, the natural language interface for asking questions requires a hybrid approach to data usage, i.e., simultaneously performing a search on full-texts and semantic knowledge bases. To close the Information Gap, this thesis presents HAWK, a novel entity search approach developed for hybrid QA based on combining structured RDF and unstructured full-text data sources. info:eu-repo/classification/ddc/000 ddc:000

Search results