• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Comquest: an Adaptive Crawler for User Comments on the Web

Chen, Zhijia, 0009-0005-7866-4549 05 1900 (has links)
This thesis introduces Comquest, an adaptive framework designed for the large-scale collection and integration of user comments from the Web. User comments are featured on many websites and there is growing interest in mining and studying user comments in applications, such as opinion mining and information diffusion. However, crawling user comments generally requires hard-coded solutions that are tethered to specific websites, which is hard to scale and maintain. To achieve a generalizable and scalable comment crawling solution, Comquest employs two website-agnostic approaches for comment crawling: Web API querying and HTML data extraction. When the target Web page is integrated with a third-party commenting system whose Web API that is in Comquest’s knowledge base, it retrieves comments by sending HTTP requests to the API’s URL with parameters extracted from the target webpage. The approach has several challenges. Firstly, extracting accurate parameter values to construct HTTP requests is difficult since they are buried deep within the HTML string of web documents (if they exist). Secondly, the solution needs to generalize both vertically (within a website) and horizontally (across unseen websites). To tackle these challenges, the parameter extraction problem is treated as a variant of the multiclass Named Entity Recognition (NER) problem, where the entities represent the values of the parameters. Comquest leverages a sequential labeling deep learning model to identify parameter values within HTML source codes. When the commenting system is native to the website or unknown, Comquest detects and extracts user comments from fully rendered Web pages. However, comments are often hidden until triggered by specific user interaction, such as clicking on a designated page element among many other clickable elements. Furthermore, comments are typically presented as structured record-like Web data with high structure variations, making them difficult to detect and extract from the target Web page along with other record-like Web data. Comquest utilizes deep learning models and Web record extraction algorithms to automate the process of triggering, extracting, and classifying comments. Comquest has been implemented as a comprehensive system that consists of an administration web portal, a task controller, and a crawler backend. It provides a useful tool for collecting comments that represent a wider range of opinions, stances, and sentiments from websites on a global scale. / Computer and Information Science

Page generated in 0.0599 seconds