Text classification is a wide research field with existing ready-to-use solutions for supervised training of text classifiers. The task of classifying short texts puts dif-ferent demands on the invoked learning system that general text classification does not. This thesis explores this challenge by experimenting on how to design the clas-sification system and what text features granted the best results. In the experimental study, a hierarchical versus a flat design was compared, along with different aspects of text features. The method consisted of training and testing on a dataset of 3.2 million samples in total. The test results were evaluated with the quality measures: precision, recall, F1-score and ROC analysis with a modification to target multi-class classification. The result of the experimental study was: 2-level hierarchical designed classifier gave better results than a flat designed classifier in 11 out of 13 occasions; integer represented terms outperformed TFIDF weighted terms of BOW features; lowercase conversion improved the classification results; bigram and tri-gram BOW features achieved better results than unigram BOW features. The results of the experimental study were used in a case study together with Thingmap, which maps natural language queries with users. The case study showed an improvement over earlier solutions of Thingmap’s system.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-323214 |
Date | January 2017 |
Creators | Sernheim, Mikael |
Publisher | Uppsala universitet, Institutionen för informationsteknologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | UPTEC IT, 1401-5749 ; 17005 |
Page generated in 0.0015 seconds