Return to search

Data extraction & semantic annotation from web query result pages

Our unquenchable thirst for knowledge is one of the few things that really defines our humanity. Yet the Information Age, which we have created, has left us floating aimlessly in a vast ocean of unintelligible data. Hidden Web databases are one massive source of structured data. The contents of these databases are, however, often only accessible through a query proposed by a user. The data returned in these Query Result Pages is intended for human consumption and, as such, has nothing more than an implicit semantic structure which can be understood visually by a human reader, but not by a computer. This thesis presents an investigation into the processes of extraction and semantic understanding of data from Query Result Pages. The work is multi-faceted and includes at the outset, the development of a vision-based data extraction tool. This work is followed by the development of a number of algorithms which make use of machine learning-based techniques first to align the data extracted into semantically similar groups and then to assign a meaningful label to each group. Part of the work undertaken in fulfilment of this thesis has also addressed the lack of large, modern datasets containing a wide range of result pages representing of those typically found online today. In particular, a new innovative crowdsourced dataset is presented. Finally, the work concludes by examining techniques from the complementary research field of Information Extraction. An initial, critical assessment of how these mature techniques could be applied to this research area is provided.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:705642
Date January 2016
CreatorsAnderson, Neil David Alan
PublisherQueen's University Belfast
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0022 seconds