Return to search

The impact of corpus choice in domain specific knowledge representation

Recent advents in the machine learning community, driven by larger datasets and novel algorithmic approaches to deep reinforcement learning, reward the use of large datasets. In this thesis, we examine whether dataset size has a signicant impact on the recall quality in a very specic knowledge domain. We compare a large corpus extracted from Wikipedia to smaller ones from Stackoverow and evaluate their representational quality of niche computer science knowledge. We show that a smaller dataset with high-quality data points greatly outperform a larger one, even though the smaller is a subset of the latter. This implicates that corpus choice is highly relevant for NLP-applications aimed toward complex and specic knowledge representations.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-220679
Date January 2017
CreatorsLewenhaupt, Adam, Brismar, Emil
PublisherKTH, Skolan för industriell teknik och management (ITM)
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess
RelationTRITA-ITM-EX ; 2018:26

Page generated in 0.0021 seconds