Return to search

Fast Data Analysis Methods For Social Media Data

The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database.

A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time. / Dissertation (MSc)--University of Pretoria, 2019. / National Research Foundation (NRF) - Scarce skills / Computer Science / MSc / Unrestricted

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:up/oai:repository.up.ac.za:2263/72546
Date07 August 2018
CreatorsNhlabano, Valentine Velaphi
ContributorsLutu, Patricia Elizabeth Nalwoga, valezw@gmail.com
PublisherUniversity of Pretoria
Source SetsSouth African National ETD Portal
LanguageEnglish
Detected LanguageEnglish
TypeDissertation
Rights© 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.

Page generated in 0.0158 seconds