This thesis work adresses text classification in relation to genre identification using different feature sets, with a focus on syntactic based features. We built our models by means of traditional machine learning algorithms, i.e. Naive Bayes, K-nearest neighbour, Support Vector Machine and Random Forest in order to predict the literary genre of books. We trained our models using as feature sets bag-of-words (BOW), bigrams, syntactic-based bigrams and emotional features, as well as combinations of features. Results obtained using the best features, i.e. BOW combined with bigrams based on syntactic relations between words, on the test set showed an enhancement in performance by 2% in F1-score over the baseline using BOW features, which translates into a positive impact of using syntactic information in the task of text classification.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-454667 |
Date | January 2021 |
Creators | Brigadoi, Ivan |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0021 seconds