Our unquenchable thirst for knowledge is one of the few things that really defines our humanity. Yet the Information Age, which we have created, has left us floating aimlessly in a vast ocean of unintelligible data. Hidden Web databases are one massive source of structured data. The contents of these databases are, however, often only accessible through a query proposed by a user. The data returned in these Query Result Pages is intended for human consumption and, as such, has nothing more than an implicit semantic structure which can be understood visually by a human reader, but not by a computer. This thesis presents an investigation into the processes of extraction and semantic understanding of data from Query Result Pages. The work is multi-faceted and includes at the outset, the development of a vision-based data extraction tool. This work is followed by the development of a number of algorithms which make use of machine learning-based techniques first to align the data extracted into semantically similar groups and then to assign a meaningful label to each group. Part of the work undertaken in fulfilment of this thesis has also addressed the lack of large, modern datasets containing a wide range of result pages representing of those typically found online today. In particular, a new innovative crowdsourced dataset is presented. Finally, the work concludes by examining techniques from the complementary research field of Information Extraction. An initial, critical assessment of how these mature techniques could be applied to this research area is provided.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:705642 |
Date | January 2016 |
Creators | Anderson, Neil David Alan |
Publisher | Queen's University Belfast |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Page generated in 0.0022 seconds