Return to search

Comquest: an Adaptive Crawler for User Comments on the Web

This thesis introduces Comquest, an adaptive framework designed for the large-scale collection and integration of user comments from the Web. User comments are featured on many websites and there is growing interest in mining and studying user comments in applications, such as opinion mining and information diffusion. However, crawling user comments generally requires hard-coded solutions that are tethered to specific websites, which is hard to scale and maintain. To achieve a generalizable and scalable comment crawling solution, Comquest employs two website-agnostic approaches for comment crawling: Web API querying and HTML data extraction. When the target Web page is integrated with a third-party commenting system whose Web API that is in Comquest’s knowledge base, it retrieves comments by sending HTTP requests to the API’s URL with parameters extracted from the target webpage. The approach has several challenges. Firstly, extracting accurate parameter values to construct HTTP requests is difficult since they are buried deep within the HTML string of web documents (if they exist). Secondly, the solution needs to generalize both vertically (within a website) and horizontally (across unseen websites). To tackle these challenges, the parameter extraction problem is treated as a variant of the multiclass Named Entity Recognition (NER) problem, where the entities represent the values of the parameters. Comquest leverages a sequential labeling deep learning model to identify parameter values within HTML source codes. When the commenting system is native to the website or unknown, Comquest detects and extracts user comments from fully rendered Web pages. However, comments are often hidden until triggered by specific user interaction, such as clicking on a designated page element among many other clickable elements. Furthermore, comments are typically presented as structured record-like Web data with high structure variations, making them difficult to detect and extract from the target Web page along with other record-like Web data. Comquest utilizes deep learning models and Web record extraction algorithms to automate the process of triggering, extracting, and classifying comments. Comquest has been implemented as a comprehensive system that consists of an administration web portal, a task controller, and a crawler backend. It provides a useful tool for collecting comments that represent a wider range of opinions, stances, and sentiments from websites on a global scale. / Computer and Information Science

Identiferoai:union.ndltd.org:TEMPLE/oai:scholarshare.temple.edu:20.500.12613/10236
Date05 1900
CreatorsChen, Zhijia, 0009-0005-7866-4549
ContributorsDragut, Eduard, Gao, Hongchang, Vucetic, Slobodan, Meng, Weiyi
PublisherTemple University. Libraries
Source SetsTemple University
LanguageEnglish
Detected LanguageEnglish
TypeThesis/Dissertation, Text
Format109 pages
RightsIN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available., http://rightsstatements.org/vocab/InC/1.0/
Relationhttp://dx.doi.org/10.34944/dspace/10198, Theses and Dissertations

Page generated in 0.0115 seconds