Return to search

Adding Semantics to Unstructured and Semi-structured Data on the Web

<p> Acquiring vast bodies of knowledge in machine-understandable form is one of the main challenges in artificial intelligence. Information Extraction is the task of automatically extracting structured, machine-understandable information from unstructured or semi-structured data. Recent advances in information extraction and the massive scale of data on the Web present a unique opportunity for artificial intelligence systems for large-scale automatic knowledge acquisition. However, to realize the full potential of the automatically extracted information, it is essential to understand their semantics. </p><p> A key step in understanding the semantics of extracted information is entity linking: the task of mapping a phrase in text to its referent entity in a given knowledge base. In addition to identifying entities mentioned in text, an AI system can benefit significantly from the organization of entities in a taxonomy. While taxonomies are used in a variety of applications, including IBM&rsquo;s Jeopardy-winning Watson system, they demand significant effort in their creation. They are either manually curated, or built using semi-supervised machine learning techniques.</p><p> This dissertation explores methods to automatically infer a taxonomy of entities, given the properties that are usually associated with them (e.g. as a City, Chicago is usually associated with properties like "population" and "area"). Our approach is based on the <i>Property Inheritance hypothesis, </i> which states that entities of a specific type in a taxonomy inherit properties from more general types. We apply this hypothesis to two distinct information extraction tasks &mdash; each of which is aimed at understanding the semantics of information mined from the Web. First, we describe the two systems (1) TABEL: a state-of-the art system that performs the task of entity linking on Web tables, and (2) SKEY: a system that extracts key phrases that summarize a document in a given corpus. We then apply topic models that encode our hypothesis in a probabilistic framework to automatically infer a taxonomy in each task.</p>

Identiferoai:union.ndltd.org:PROQUEST/oai:pqdtoai.proquest.com:10117145
Date09 June 2016
CreatorsBhagavatula, Chandra Sekhar
PublisherNorthwestern University
Source SetsProQuest.com
LanguageEnglish
Detected LanguageEnglish
Typethesis

Page generated in 0.002 seconds