In this study we gather customer reviews from Prisjakt, a Swedish price comparison site, with the goal to study the relationship between review and rating, known as sentiment analysis. The purpose of the study is to evaluate three different supervised machine learning models on a fine-grained dependent variable representing the review rating. For classification, a binary and multinomial model is used with the one-versus-one strategy implemented in the Support Vector Machine, with a linear kernel, evaluated with F1, accuracy, precision and recall scores. We use Support Vector Regression by approximating the fine-grained variable as continuous, evaluated using MSE. Furthermore, three models are evaluated on a balanced and unbalanced dataset in order to investigate the effects of class imbalance. The results show that the SVR performs better on unbalanced fine-grained data, with the best fine-grained model reaching a MSE 4.12, compared to the balanced SVR (6.84). The binary SVM model reaches an accuracy of 86.37% and weighted F1 macro of 86.36% on the unbalanced data, while the balanced binary SVM model reaches approximately 80% for both measures. The multinomial model shows the worst performance due to the inability to handle class imbalance, despite the implementation of class weights. Furthermore, results from feature engineering shows that SVR benefits marginally from certain regex conversions, and tf-idf weighting shows better performance on the balanced sets compared to the unbalanced sets.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-424266 |
Date | January 2020 |
Creators | Westin, Emil |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0017 seconds