Global ETD Search

Return to search

adXtractor – Automated and Adaptive Generation of Wrappers for Information Retrieval

The aim of this project is to investigate the feasibility of retrieving unstructured automotive listings from structured web pages on the Internet. The research has two major purposes: (1) to investigate whether it is feasible to pair information extraction algorithms and compute wrappers (2) demonstrate the results of pairing these techniques and evaluate the measurements. We merge two training sets available on the web to construct reference sets which is the basis for the information extraction. The wrappers are computed by using information extraction techniques to identify data properties with a variety of techniques such as fuzzy string matching, regular expressions and document tree analysis. The results demonstrate that it is possible to pair these techniques successfully and retrieve the majority of the listings. Additionally, the findings also suggest that many platforms utilise lazy loading to populate image resources which the algorithm is unable to capture. In conclusion, the study demonstrated that it is possible to use information extraction to compute wrappers dynamically by identifying data properties. Furthermore, the study demonstrates the ability to open non-queryable domain data through a unified service.

wrapper generation

information extraction

content of interest identification

main content identification

web scraping

information extraction algorithms

web extraction

dom tree analysis

dom analysis

Engineering and Technology

Teknik och teknologier

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:mau-20071
Date	January 2017
Creators	Ademi, Muhamet
Publisher	Malmö högskola, Fakulteten för teknik och samhälle (TS), Malmö högskola/Teknik och samhälle
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0024 seconds

adXtractor – Automated and Adaptive Generation of Wrappers for Information Retrieval

Description

Links & Downloads

Tags

Additional Fields