Global ETD Search

Return to search

The impact of corpus choice in domain specific knowledge representation

Recent advents in the machine learning community, driven by larger datasets and novel algorithmic approaches to deep reinforcement learning, reward the use of large datasets. In this thesis, we examine whether dataset size has a signicant impact on the recall quality in a very specic knowledge domain. We compare a large corpus extracted from Wikipedia to smaller ones from Stackoverow and evaluate their representational quality of niche computer science knowledge. We show that a smaller dataset with high-quality data points greatly outperform a larger one, even though the smaller is a subset of the latter. This implicates that corpus choice is highly relevant for NLP-applications aimed toward complex and specic knowledge representations.

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-220679

word2vec

Media Engineering

Mediateknik

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-220679
Date	January 2017
Creators	Lewenhaupt, Adam, Brismar, Emil
Publisher	KTH, Skolan för industriell teknik och management (ITM)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-ITM-EX ; 2018:26

Page generated in 0.0014 seconds

The impact of corpus choice in domain specific knowledge representation

Description

Links & Downloads

Tags

Additional Fields