Background: Twitter, Facebook, WordPress, etc. act as the major sources of information exchange in today's world. The tweets on Twitter are mainly based on the public opinion on a product, event or topic and thus contains large volumes of unprocessed data. Synthesis and Analysis of this data is very important and difficult due to the size of the dataset. Sentiment analysis is chosen as the apt method to analyse this data as this method does not go through all the tweets but rather relates to the sentiments of these tweets in terms of positive, negative and neutral opinions. Sentiment Analysis is normally performed in 3 ways namely Machine learning-based approach, Sentiment lexicon-based approach, and Hybrid approach. The Machine learning based approach uses machine learning algorithms and deep learning algorithms for analysing the data, whereas the sentiment lexicon-based approach uses lexicons in analysing the data and they contain vocabulary of positive and negative words. The Hybrid approach uses a combination of both Machine learning and sentiment lexicon approach for classification. Objectives: The primary objectives of this research are: To identify the algorithms and metrics for evaluating the performance of Machine Learning Classifiers. To compare the metrics from the identified algorithms depending on the size of the dataset that affects the performance of the best-suited algorithm for sentiment analysis. Method: The method chosen to address the research questions is Experiment. Through which the identified algorithms are evaluated with the selected metrics. Results: The identified machine learning algorithms are Naïve Bayes, Random Forest, XGBoost and the deep learning algorithm is CNN-LSTM. The algorithms are evaluated with respect to the metrics namely precision, accuracy, F1 score, recall and compared. CNN-LSTM model is best suited for sentiment analysis on twitter data with respect to the selected size of the dataset. Conclusion: Through the analysis of results, the aim of this research is achieved in identifying the best-suited algorithm for sentiment analysis on twitter data with respect to the selected dataset. CNN-LSTM model results in having the highest accuracy of 88% among the selected algorithms for the sentiment analysis of Twitter data with respect to the selected dataset.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:bth-18447 |
Date | January 2019 |
Creators | Manda, Kundan Reddy |
Publisher | Blekinge Tekniska Högskola, Institutionen för datavetenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0015 seconds