Global ETD Search

Return to search

Klasifikace žánrů pomocí strojového učení / Genres classification by means of machine learning

In this thesis, we compare the bag of words approach with doc2vec doc- ument embeddings on the task of classification of book genres. We cre- ate 3 datasets with different text lengths by extracting short snippets from books in Project Gutenberg repository. Each dataset comprises of more than 200000 documents and 14 different genres. For 3200-character documents, we achieve F1-score of 0.862 when stacking models trained on both bag of words and doc2vec representations. We also explore the relationships be- tween documents, genres and words using similarity metrics on their vector representations and report typical words for each genre. As part of the thesis, we also present an online webapp for book genre classification. 1

http://www.nusl.cz/ntk/nusl-388111

Identifer	oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:388111
Date	January 2018
Creators	Bílek, Jan
Contributors	Neruda, Roman, Vomlelová, Marta
Source Sets	Czech ETDs
Language	English
Detected Language	English
Type	info:eu-repo/semantics/masterThesis
Rights	info:eu-repo/semantics/restrictedAccess

Page generated in 0.0016 seconds

Klasifikace žánrů pomocí strojového učení / Genres classification by means of machine learning

Description

Links & Downloads

Tags

Additional Fields