Global ETD Search

Return to search

Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets

Text analysis is a significant branch of natural language processing, and includes manydifferent sub-fields such as topic modeling, document classification, and sentiment analysis.Unsurprisingly, those who do text analysis are concerned with the runtime of their algorithmsSome of these algorithms have runtimes that depend jointly on the size of the corpus beinganalyzed, as well as the size of that corpus's vocabulary. Trivially, a user may reduce theamount of data they feed into their model to speed it up, but we assume that users will behesitant to do this as more data tends to lead to better model quality. On the other hand,when the runtime also depends on the vocabulary of the corpus, a user may instead modifythe vocabulary to attain a faster runtime. Because elements of the vocabulary also add tomodel quality, this puts users into the position of needing to modify the corpus vocabulary inorder to reduce the runtime of their algorithm while maintaining model quality. To this end,we look at the relationship between model quality and runtime for text analysis by looking atthe effect that current techniques in vocabulary reduction have on algorithmic runtime andcomparing that with their effect on model quality. Despite the fact that this is an importantrelationship to investigate, it appears little work has been done in this area. We find thatmost preprocessing methods do not have much of an effect on more modern algorithms, butproper rare word filtering gives the best results in the form of significant runtime reductionstogether with slight improvements in accuracy and a vocabulary size that scales efficiently aswe increase the size of the data.

document classification

text preprocessing

vocabulary reduction

nlp

Physical Sciences and Mathematics

Identifer	oai:union.ndltd.org:BGMYU2/oai:scholarsarchive.byu.edu:etd-10062
Date	05 December 2019
Creators	Fearn, Wilson Murray
Publisher	BYU ScholarsArchive
Source Sets	Brigham Young University
Detected Language	English
Type	text
Format	application/pdf
Source	Theses and Dissertations
Rights	https://lib.byu.edu/about/copyright/

Page generated in 0.0081 seconds

Exploring the Relationship Between Vocabulary Scaling and Algorithmic Performance in Text Classification for Large Datasets

Description

Links & Downloads

Tags

Additional Fields