Return to search

Extracting structured data from Web query result pages

A rapidly increasing number of Web databases are now become accessible via their HTML form- based query interfaces only. Comparing various services or products from a number of web sites in a specific domain is time-consuming and tedious. There is a demand for value-added Web applications that integrate data from multiple sources. To facilitate the development of such applications, we need to develop techniques for automating the process of providing integrated access to a multitude of database-driven Web sites, and integrating data from their underlying databases. This presents three challenges, namely query form extraction, query form matching and translation, and Web query result extraction. In this thesis, 1 focus on Web query result extraction, which aims to extract structured data encoded in semi-structured HTML pages, and return extracted data in relational tables. 1 begin by reviewing the existing approaches for Web query result extraction. 1 categorize them based on their degree of automation, i.e. manual, semi-automatic and fully automatic approaches. For each category, every approach will be described in terms of its technical features, followed by an analysis listing the advantages and limitations of the approach. The literature review leads to my proposed approaches, which resolve the Web data extraction problem, i.e. Web data record extraction, Web data alignment and Web data annotation. Each approach is presented in a chapter which includes the methodology, experiment and related work. The last chapter concludes the thesis.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:709858
Date January 2016
CreatorsWeng, Daiyue
PublisherQueen's University Belfast
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0027 seconds