Global ETD Search

Return to search

Automated classification of bibliographic data using SVM and Naive Bayes

Classification of scientific bibliographic data is an important and increasingly more time-consuming task in a “publish or perish” paradigm where the number of scientific publications is steadily growing. Apart from being a resource-intensive endeavor, manual classification has also been shown to be often performed with a quite high degree of inconsistency. Since many bibliographic databases contain a large number of already classified records supervised machine learning for automated classification might be a solution for handling the increasing volumes of published scientific articles. In this study automated classification of bibliographic data, based on two different machine learning methods; Naive Bayes and Support Vector Machine (SVM), were evaluated. The data used in the study were collected from the Swedish research database SwePub and the features used for training the classifiers were based on abstracts and titles in the bibliographic records. The accuracy achieved ranged between a lowest score of 0.54 and a highest score of 0.84. The classifiers based on Support Vector Machine did consistently receive higher scores than the classifiers based on Naive Bayes. Classification performed at the second level in the hierarchical classification system used clearly resulted in lower scores than classification performed at the first level. Using abstracts as the basis for feature extraction yielded overall better results than using titles, the differences were however very small.

http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-75167

automated classification

machine learning

Naive Bayes

Support Vector Machine

SVM

bibliographic data

SwePub

Computer and Information Sciences

Data- och informationsvetenskap

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-75167
Date	January 2018
Creators	Nordström, Jesper
Publisher	Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0019 seconds

Automated classification of bibliographic data using SVM and Naive Bayes

Description

Links & Downloads

Tags

Additional Fields