Classification of scientific bibliographic data is an important and increasingly more time-consuming task in a “publish or perish” paradigm where the number of scientific publications is steadily growing. Apart from being a resource-intensive endeavor, manual classification has also been shown to be often performed with a quite high degree of inconsistency. Since many bibliographic databases contain a large number of already classified records supervised machine learning for automated classification might be a solution for handling the increasing volumes of published scientific articles. In this study automated classification of bibliographic data, based on two different machine learning methods; Naive Bayes and Support Vector Machine (SVM), were evaluated. The data used in the study were collected from the Swedish research database SwePub and the features used for training the classifiers were based on abstracts and titles in the bibliographic records. The accuracy achieved ranged between a lowest score of 0.54 and a highest score of 0.84. The classifiers based on Support Vector Machine did consistently receive higher scores than the classifiers based on Naive Bayes. Classification performed at the second level in the hierarchical classification system used clearly resulted in lower scores than classification performed at the first level. Using abstracts as the basis for feature extraction yielded overall better results than using titles, the differences were however very small.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-75167 |
Date | January 2018 |
Creators | Nordström, Jesper |
Publisher | Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM) |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0021 seconds