Recent advents in the machine learning community, driven by larger datasets and novel algorithmic approaches to deep reinforcement learning, reward the use of large datasets. In this thesis, we examine whether dataset size has a signicant impact on the recall quality in a very specic knowledge domain. We compare a large corpus extracted from Wikipedia to smaller ones from Stackoverow and evaluate their representational quality of niche computer science knowledge. We show that a smaller dataset with high-quality data points greatly outperform a larger one, even though the smaller is a subset of the latter. This implicates that corpus choice is highly relevant for NLP-applications aimed toward complex and specic knowledge representations.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-220679 |
Date | January 2017 |
Creators | Lewenhaupt, Adam, Brismar, Emil |
Publisher | KTH, Skolan för industriell teknik och management (ITM) |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | TRITA-ITM-EX ; 2018:26 |
Page generated in 0.0014 seconds