This thesis aims at examining to what extent a few, algorithmically very easily extractable document features can be used to classify electronic documents according to genre. A set of experiments is therefore carried out, using only 11 such simple features in an attempt to classify 84 documents belonging to electronic academic journals into three manually identified genres: table of contents, article, and review. The 11 features are also divided into three sets, containing metrics of words and sentences; punctuation marks; and URL links, respectively. The performance when using these sets of features is then measured with regard to classification accuracy, using a k-NN classifier, four different values of k (1, 3, 5, 7), and both leave-one-out and 10-fold cross-validation. Best results are achieved when using all three feature sets (i.e. all 11 features) and k=3, with an overall accuracy of 96% (81 of the 84 documents correctly classified), regardless of method for cross-validation. These results are significantly better than those of a referential baseline, conceived as the case where all instances would be guessed as belonging to the most populated class, with a corresponding accuracy of 49%. While not considered as disappointing in any way, the results are viewed by the author as perhaps an expression of a somewhat easy classification task. He therefore concludes by advocating further research on the capability of very simple features in contributing to accurate automatic genre classification, preferably by the use of experimental settings that are better suited to shed light on this matter. / Uppsatsnivå: D
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:hb-18835 |
Date | January 2008 |
Creators | Nolgren, Markus |
Publisher | Högskolan i Borås, Institutionen Biblioteks- och informationsvetenskap / Bibliotekshögskolan, University of Borås/Swedish School of Library and Information Science (SSLIS) |
Source Sets | DiVA Archive at Upsalla University |
Language | Swedish |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | Magisteruppsats i biblioteks- och informationsvetenskap vid institutionen Biblioteks- och informationsvetenskap, 1654-0247 ; 2008:36 |
Page generated in 0.0022 seconds