Return to search

Identifying Content Blocks on Web Pages using Recursive Neural Networks and DOM-tree Features / Identifiering av innehållsblock på hemsidor med rekursiva neurala nätverk och DOM-trädattribut

The internet is a source of abundant information spread across different web pages. The identification and extraction of information from the internet has long been an active area of research for multiple purposes relating to both research and business intelligence. However, many of the existing systems and techniques rely on assumptions that limit their general applicability and negatively affect their performance as the web changes and evolves. This work explores the use of Recursive Neural Networks (RecNNs) along with the extensive amount of features present in the DOM-trees for web pages as a technique for identifying information on the internet without the need for strict assumptions on the structure or content of web pages. Furthermore, the use of Sparse Group LASSO (SGL) is explored as an effective tool for performing feature selection in the context of web information extraction. The results show that a RecNN model outperforms a similarly structured feedforward baseline for the task of identifying cookie consent dialogs across various web pages. Furthermore, the results suggest that SGL can be used as an effective tool for feature selection of DOM-tree features.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-166927
Date January 2020
CreatorsRiddarhaage, Teodor
PublisherLinköpings universitet, Interaktiva och kognitiva system
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0018 seconds