Return to search

Automated classification of bibliographic data using SVM and Naive Bayes

Classification of scientific bibliographic data is an important and increasingly more time-consuming task in a “publish or perish” paradigm where the number of scientific publications is steadily growing. Apart from being a resource-intensive endeavor, manual classification has also been shown to be often performed with a quite high degree of inconsistency. Since many bibliographic databases contain a large number of already classified records supervised machine learning for automated classification might be a solution for handling the increasing volumes of published scientific articles. In this study automated classification of bibliographic data, based on two different machine learning methods; Naive Bayes and Support Vector Machine (SVM), were evaluated. The data used in the study were collected from the Swedish research database SwePub and the features used for training the classifiers were based on abstracts and titles in the bibliographic records. The accuracy achieved ranged between a lowest score of 0.54 and a highest score of 0.84. The classifiers based on Support Vector Machine did consistently receive higher scores than the classifiers based on Naive Bayes. Classification performed at the second level in the hierarchical classification system used clearly resulted in lower scores than classification performed at the first level. Using abstracts as the basis for feature extraction yielded overall better results than using titles, the differences were however very small.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-75167
Date January 2018
CreatorsNordström, Jesper
PublisherLinnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM)
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0021 seconds