Return to search

Genre classification using syntactic features

This thesis work adresses text classification in relation to genre identification using different feature sets, with a focus on syntactic based features. We built our models by means of traditional machine learning algorithms, i.e. Naive Bayes, K-nearest neighbour, Support Vector Machine and Random Forest in order to predict the literary genre of books. We trained our models using as feature sets bag-of-words (BOW), bigrams, syntactic-based bigrams and emotional features, as well as combinations of features. Results obtained using the best features, i.e. BOW combined with bigrams based on syntactic relations between words, on the test set showed an enhancement in performance by 2% in F1-score over the baseline using BOW features, which translates into a positive impact of using syntactic information in the task of text classification.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-454667
Date January 2021
CreatorsBrigadoi, Ivan
PublisherUppsala universitet, Institutionen för lingvistik och filologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0023 seconds