Return to search

Automated domain-aware form understanding with OPAL : with a case study in the UK real-estate domain

Web forms are the interfaces to the deep web, and automated form under- standing is the key to unlock its contents. It is a fundamental problem in many applications and research fields, such as deep web crawling, data in- tegration, or information extraction. It is also essential for improving web usability and accessibility. Form understanding is an inherently empirical problem. Existing form un- derstanding approaches are restricted by exploiting limited and domain inde- pendent feature sets leading to overly generic and monolithic algorithms. In response, we present OPAL (Ontology based web Pattern Analysis with Logic), a domain-aware form understanding approach, that addresses all these lim- itations through a novel multi-scope approach. OPAL achieves this through a domain independent form labeling and a domain dependent form interpre- tation. In form labeling, OPAL associates texts with fields as labels through three domain independent scopes exploiting textual, structural, and visual information. In form interpretation, OPAL integrates the form labeling ob- tained with a layer of high-level domain knowledge to classify form fields and to repair the form model. To ease the task of designing domain schemata, we develop the template lan- guage OPAL-TL to express domain types and their structural constraints. With OPAL-TL, we describe common design patterns as templates maintained in a library. Thus, the adaption to new domains often requires only instantiation of the templates with corresponding domain types. We conduct extensive experiments, that cover both domain independent cross- domain testing with standard form understanding benchmarks, and a domain- aware evaluation with two domain datasets randomly selected from real estate and used car domain. OPAL outperforms previous works by a significant mar- gin and pushes the state of the art to near perfect accuracy (> 98%). In an effort to integrate OPAL with an entire data extraction pipeline, we plan to extend OPAL with form probing and to exploit information obtained by other data extraction components, e.g., result page analysis.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:565960
Date January 2012
CreatorsGuo, Xiaonan
ContributorsGottlob, Georg
PublisherUniversity of Oxford
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0019 seconds