Spelling suggestions: "subject:"automatiskt klassifikation""
1 |
Filtrering av e-post : Binär klassifikation med naiv Bayesiansk teknik / Filtering e-mail : Binary classification with naïve Bayesian techniqueBünger, Sara, Nilsson, Stefan January 2007 (has links)
In this thesis we compare how different strategies in choosing attribute values affects junk mail filtering. We used two different variants of a naïve Bayesian junk mail filter. The first variant classified an e-mail by comparing it to a feature vector containing all attribute values that were found in junk mails in the part of the e-mail collection we used for training the filter. The second variant compared an e-mail to a feature vector that consisted of the attributes that was found in ten or more junk mails in the part of the e-mail collection we used for training the filter. We used an e-mail collection that consisted of 300 e-mails, 210 of these were junk mails and 90 were legitimate e-mails. We measured the results in our study using; SP, SR and F1 and to be able to compare the two different strategies we cross validated them. The results we got in our study showed that the first strategy got higher average F1 values than our second strategy. Despite of this we believe that the second strategy is the better one. Instead of comparing the e-mail to a feature vector containing all attribute values found in junk mails, the results will be better if the filter compares the e-mail to a feature vector that contains a limited amount of attribute values. / Uppsatsnivå: D
|
2 |
Automatisk genreklassifikation : en experimentell studie / Automatic genre classification : an experimental studyNolgren, Markus January 2008 (has links)
This thesis aims at examining to what extent a few, algorithmically very easily extractable document features can be used to classify electronic documents according to genre. A set of experiments is therefore carried out, using only 11 such simple features in an attempt to classify 84 documents belonging to electronic academic journals into three manually identified genres: table of contents, article, and review. The 11 features are also divided into three sets, containing metrics of words and sentences; punctuation marks; and URL links, respectively. The performance when using these sets of features is then measured with regard to classification accuracy, using a k-NN classifier, four different values of k (1, 3, 5, 7), and both leave-one-out and 10-fold cross-validation. Best results are achieved when using all three feature sets (i.e. all 11 features) and k=3, with an overall accuracy of 96% (81 of the 84 documents correctly classified), regardless of method for cross-validation. These results are significantly better than those of a referential baseline, conceived as the case where all instances would be guessed as belonging to the most populated class, with a corresponding accuracy of 49%. While not considered as disappointing in any way, the results are viewed by the author as perhaps an expression of a somewhat easy classification task. He therefore concludes by advocating further research on the capability of very simple features in contributing to accurate automatic genre classification, preferably by the use of experimental settings that are better suited to shed light on this matter. / Uppsatsnivå: D
|
Page generated in 0.1872 seconds