Return to search

Jointly Mining News and User-Generated Content: Machine Learning, Information and Social Network Perspective

The amount of published news articles is steadily increasing, and readers are shifting toward online platforms because of the convenience and affordable technology costs (Shearer, 2021). Users have become more engaged with online news articles. This engagement creates a rich corpus, which makes it a powerful means to understand public opinion, emerging events, and their evolvement. Therefore, many organizations invest in mining this large-scale user-generated content to improve their products, services, and, more importantly, their decision-making process. Studying users’ reactions to online news is essential for social scientists, policymakers, and journalists. This type of engagement is an area of study introduced previously. In the statistical and machine learning community, many survey-based studies tried to understand the users’ behavior by characterizing and categorizing comments in online news. Some studies focus on mining user opinions from social media and online news comments. Other works look into bias in the news and its influence on user-generated content. At the same time, the social network community addresses the problem of mining large-scale online news from different angles. Some work focuses on constructing knowledge graphs from the text. Others focus on building high-level graphs, where nodes are users and posts or documents, and links represent the relationship between nodes. Another line of work looked into the word level of the text. They extracted entities and topics by combining Natural Language Processing and graph techniques. From a Machine Learning perspective, there are three main challenges in all these studies 1) jointly mining massive user-generated data, 2) from multiple sources and platforms, and 3) the unpredictable quality of user-generated content.
To address these issues, we tackle the problem of jointly learning and mining valuable information from online news articles and user-generated content. We start by studying and understating the relationship between users’ comments and articles in online news. Where the focus is to understand the level of relevancy between articles and their comments, we labeled a few article-comment pairs in this work. We proposed BERTAC (Alshehri et al.,2021), a BERT-based model that jointly learns article-comment embeddings and infers the relevance class of comment. However, we found that the disagreement among annotators as a part of a human (expert) labeling process produces noisy labels, which affect the performance of supervised learning algorithms. On the other hand, working only with high agreement annotations introduces another challenge: the data imbalance problem (Alshehri et al., 2022). As in many machine learning problems, labeling a sufficient number of examples is costly and time-consuming. Therefore, we propose a framework for aligning comments and news articles under a constrained budget(Alshehri et al., 2023a). The proposed model considers the data imbalanced, where we have only a few examples from one class, in addition, it considers the degrees of annotator disagreement. Within the framework, we consider two solutions, 1) semi-automatic labeling based on human-AI collaboration and 2) synthetic data augmentation. Another critical aspect of mining news articles and user-generated content is understanding emerging events and their associated entities. However, this is challenging, especially with the massive growth of online articles and user-generated content across different platforms. Therefore, we proposed MultiLayerET (Alshehri et al., 2023b), a unified representation of online news articles and comments. This work highlights the relationship between entities and topics in news articles and user-generated content. It projects entities and topics as a multi-layer graph, which gives a high-level understanding of the story behind the large pile of the corpus. We showed that such graphs enrich the textual representation and enhance the model learning performance in many downstream applications, such as media bias classification and fake news detection. / Computer and Information Science

Identiferoai:union.ndltd.org:TEMPLE/oai:scholarshare.temple.edu:20.500.12613/8576
Date January 2023
CreatorsAlshehri, Jumanah, 0000-0002-0077-7173
ContributorsObradovic, Zoran, Dragut, Eduard Constantin, Vucetic, Slobodan, Fink, Edward L.
PublisherTemple University. Libraries
Source SetsTemple University
LanguageEnglish
Detected LanguageEnglish
TypeThesis/Dissertation, Text
Format106 pages
RightsIN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available., http://rightsstatements.org/vocab/InC/1.0/
Relationhttp://dx.doi.org/10.34944/dspace/8540, Theses and Dissertations

Page generated in 0.0026 seconds