In this thesis, we investigate the usefulness of a group of features in genre classification problems for news. We choose a diverse feature set, covering features related to content and styles of the texts. The features are divided into two groups: semantic and stylistic. More specifically, the semantic features include genre-exclusive words, emotional words and synonyms. The stylistic features include character-level and document-level features. We use three traditional machine learning classification models and one neural network model to evaluate the effects of our features: Support Vector Machine, Complement Naive Bayes, k-Nearest Neighbor, and Convolutional Neural Networks. The results are evaluated by F1 score, precision and recall (both micro- and macro-averaged). We compare the performance of different models to find the optimal feature set for this news genre classification task, and meanwhile seek the most suitable classifier. We show that genre-exclusive words and synonyms are beneficial to the classification task, in that they are the most informative features in the training process. Emotional words have negative effect on the results. We present the best result of 0.97 by macro-average F1 score, precision and recall on the feature set combining the preprocessed dataset and its synonym sets generated based on contexts classified by the Complement Naive Bayes model. We discuss the results achieved from the experiments and the best-performing models, answer the research questions, and provide suggestions for future studies.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-477206 |
Date | January 2022 |
Creators | Pei, Ziming |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0021 seconds