Global ETD Search

1	AMBER : a domain-aware template based system for data extraction Cheng, Wang January 2015 (has links) The web is the greatest information source in human history, yet finding all offers for flats with gardens in London, Paris, and Berlin or all restaurants open after a screening of the latest blockbuster remain hard tasks – as that data is not easily amenable to processing. Extracting web data into databases for easier processing has been a resource-intensive process, requiring human supervision for every source from which to extract. This has been changing with approaches that replace human annotators with automated annotations. Such approaches could be successfully applied to restricted settings such as single attribute extraction or for domains with significant redundancy among sources. Multi-attribute objects are often presented on (i) Result pages, where multiple objects are presented on a single page as lists, tables or grids, with most important attributes and a summary description, (ii) Detail pages, where each page provides a detailed list of attributes and long description for a single entity, often in rich format. Both result and detail pages are having their own advantages. Extracting objects from result pages is orders of magnitude faster than from detail pages, and the links to detail pages are often only accessible through result pages. Detail pages have a complete list of attributes and full description of the entity. Early web data extraction approaches requires manual annotations for each web site to reach high accuracy, while a number of domain independent approaches only focus on unsupervised repeated structure segmentation. The former is limited in scaling and automation, while the latter is lacked in accuracy. Recent automated data extraction systems are often informed with an ontology and a set of object and attribute recognizers, however they have focused on extracting simple objects with few attributes from single-entity pages and avoided result pages. We present an automatic ontology-based multi-attribute object extraction system AMBER, which deals with both result and detail pages, achieves very high accuracy (>96%) with zero site-specific supervision, and is able to solve practical issues that arise in real-life data extraction tasks. AMBER is also applied as an important component of DIADEM, the first automatic full-site extraction system that is able to extract structured data from different domains without site-specific supervision, and has been tested through a large-scale evaluation (>10, 000) sites. On the result page side, AMBER achieves high accuracy through a novel domain- aware, path-based template discovery algorithm, and integrates annotations for all parts of the extraction, from identifying the primary list of objects, over segment- ing the individual objects, to aligning the attributes. Yet, AMBER is able to tolerate significant noise in the annotations, by combining these annotations with a novel algorithm for finding regular structures based on XPATH expressions that capture regular tree structures. On the detail page side, AMBER integrates boilerplate removal, dynamic lists identification and page dissimilarity calculation seamlessly to identify data region, then employs a set of fairly simple and cheaply computable features for attribute extraction. Besides, AMBER is the first approach that combines result page extraction and detail page extraction by integrating attributes extracted from result pages and the attributes found on corresponding detail pages. AMBER is able to identify attributes of objects with near perfect accuracy and to extract dozens of attributes with > 96% across several domains, even in presence of significant noise. It outperforms uninformed, automated approaches by a wide margin if given an ontology. Even in absence of an ontology, AMBER outperforms most previous systems on record segmentation. 006.3
2	adXtractor – Automated and Adaptive Generation of Wrappers for Information Retrieval Ademi, Muhamet January 2017 (has links) The aim of this project is to investigate the feasibility of retrieving unstructured automotive listings from structured web pages on the Internet. The research has two major purposes: (1) to investigate whether it is feasible to pair information extraction algorithms and compute wrappers (2) demonstrate the results of pairing these techniques and evaluate the measurements. We merge two training sets available on the web to construct reference sets which is the basis for the information extraction. The wrappers are computed by using information extraction techniques to identify data properties with a variety of techniques such as fuzzy string matching, regular expressions and document tree analysis. The results demonstrate that it is possible to pair these techniques successfully and retrieve the majority of the listings. Additionally, the findings also suggest that many platforms utilise lazy loading to populate image resources which the algorithm is unable to capture. In conclusion, the study demonstrated that it is possible to use information extraction to compute wrappers dynamically by identifying data properties. Furthermore, the study demonstrates the ability to open non-queryable domain data through a unified service. wrapper generation information extraction content of interest identification wrapper rules text extraction key value pair wrapper generate main content identification web scraping information extraction algorithms web extraction dom tree analysis dom analysis Engineering and Technology Teknik och teknologier
3	Získávání znalostí z veřejných semistrukturovaných dat na webu / Knowledge Discovery in Public Semistructured Data on the Web Kefurt, Pavel January 2016 (has links) The first part of the thesis deals with the methods and tools that can be used to retrieve data from websites and the tools used for data mining. The second part is devoted to practical demonstration of the entire process. Web Czech Dance Sport Federation, which is available on www.csts.cz , is used as the source web site.

1

Page generated in 0.1117 seconds