Return to search

Deriving Semantic Objects from the Structured Web (Inférer des Objects Sémantiques du Web Structuré)

This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose. Next, in the context of dynamically-generated Web pages by content management systems, we present two keyword-based techniques that perform article extraction from such pages. Keywords, either automatically acquired through a Tf−Idf analysis, or extracted from Web feeds, guide the process of object identification, either at the level of a single Web page (SIGFEED algorithm), or across different pages sharing the same template (FOREST algorithm). We finally present, in the context of the deep Web, a generic framework which aims at discovering the semantic model of a Web object (here, data record) by, first, using FOREST for the extraction of objects, and second, by representing the implicit rdf:type similarities between the object attributes and the entity of the Web interface as relationships that, together with the instances extracted from the objects, form a labeled graph. This graph is further aligned to a generic ontology like YAGO for the discovery of the graph's unknown types and relations.

Identiferoai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-00922459
Date29 October 2012
CreatorsOita, Marilena
PublisherTelecom ParisTech
Source SetsCCSD theses-EN-ligne, France
LanguageEnglish
Detected LanguageEnglish
TypePhD thesis

Page generated in 0.0024 seconds