Global ETD Search

Return to search

Identifying Content Blocks on Web Pages using Recursive Neural Networks and DOM-tree Features / Identifiering av innehållsblock på hemsidor med rekursiva neurala nätverk och DOM-trädattribut

The internet is a source of abundant information spread across different web pages. The identification and extraction of information from the internet has long been an active area of research for multiple purposes relating to both research and business intelligence. However, many of the existing systems and techniques rely on assumptions that limit their general applicability and negatively affect their performance as the web changes and evolves. This work explores the use of Recursive Neural Networks (RecNNs) along with the extensive amount of features present in the DOM-trees for web pages as a technique for identifying information on the internet without the need for strict assumptions on the structure or content of web pages. Furthermore, the use of Sparse Group LASSO (SGL) is explored as an effective tool for performing feature selection in the context of web information extraction. The results show that a RecNN model outperforms a similarly structured feedforward baseline for the task of identifying cookie consent dialogs across various web pages. Furthermore, the results suggest that SGL can be used as an effective tool for feature selection of DOM-tree features.

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166927

machine learning

recursive neural networks

web information extraction

Computer Sciences

Datavetenskap (datalogi)

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-166927
Date	January 2020
Creators	Riddarhaage, Teodor
Publisher	Linköpings universitet, Interaktiva och kognitiva system
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0018 seconds

Identifying Content Blocks on Web Pages using Recursive Neural Networks and DOM-tree Features / Identifiering av innehållsblock på hemsidor med rekursiva neurala nätverk och DOM-trädattribut

Description

Links & Downloads

Tags

Additional Fields