Return to search

Online Analysis of High Volume Social Text Streams

Social media is one of the most disruptive developments of the past decade. The impact of this information revolution
has been fundamental on our society. Information dissemination has never been cheaper and users are increasingly
connected with each other. The line between content producers and consumers is blurred, leaving us with
abundance of data produced in real-time by users around the world on multitude of topics.

In this thesis we study techniques to aid an analyst in uncovering insights from this new media form which is modeled
as a high volume social text stream. The aim is to develop practical algorithms with focus on the ability to scale,
amenability to reliable operation, usability, and ease of implementation.
Our work lies at the intersection of building large scale real world systems and developing
theoretical foundation to support the same.

We identify three key predicates to enable online methods for analysis of social data, namely :
- Persistent Chatter Discovery to explore topics discussed over a period of time,
- Cross-referencing Media Sources to initiate analysis using a document as the query, and
- Contributor Understanding to create aggregate expertise and topic summaries of authors contributing online.
The thesis defines each of the predicates in detail and
covers proposed techniques, their practical applicability, and detailed experimental results to establish accuracy and
scalability for each of the three predicates.

We present BlogScope, the core data aggregation and management platform, developed as part of the thesis to enable
implementation of the key predicates in real world setting. The system provides a web based user interface for searching
social media conversations and analyzing the results in multitude of ways.
BlogScope, and its modified versions, index tens to hundreds
of billions of text documents while providing interactive query times. Specifically, BlogScope has been crawling 50 million
active blogs with 3.25 billion blog posts. Same techniques have also been successfully tested on a Twitter stream of data, adding
thousands of new Tweets every second and archiving over 30 billion documents.
The social graph part of our database consists of 26 million
Twitter user nodes with 17 billion follower edges. The BlogScope system has been used by over 10,000 unique visitors
a day, and the commercial version of the system is used by thousands of enterprise clients globally.

As social media continues to evolve at an exponential pace, there is a lot that still needs to be studied. The thesis concludes by
outlining some of possible future research directions.

Identiferoai:union.ndltd.org:TORONTO/oai:tspace.library.utoronto.ca:1807/43485
Date07 January 2014
CreatorsBansal, Nilesh
ContributorsKoudas, Nick
Source SetsUniversity of Toronto
Languageen_ca
Detected LanguageEnglish
TypeThesis

Page generated in 0.0024 seconds