This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose. Next, in the context of dynamically-generated Web pages by content management systems, we present two keyword-based techniques that perform article extraction from such pages. Keywords, either automatically acquired through a Tf−Idf analysis, or extracted from Web feeds, guide the process of object identification, either at the level of a single Web page (SIGFEED algorithm), or across different pages sharing the same template (FOREST algorithm). We finally present, in the context of the deep Web, a generic framework which aims at discovering the semantic model of a Web object (here, data record) by, first, using FOREST for the extraction of objects, and second, by representing the implicit rdf:type similarities between the object attributes and the entity of the Web interface as relationships that, together with the instances extracted from the objects, form a labeled graph. This graph is further aligned to a generic ontology like YAGO for the discovery of the graph's unknown types and relations.
Identifer | oai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-00922459 |
Date | 29 October 2012 |
Creators | Oita, Marilena |
Publisher | Telecom ParisTech |
Source Sets | CCSD theses-EN-ligne, France |
Language | English |
Detected Language | English |
Type | PhD thesis |
Page generated in 0.0019 seconds