Global ETD Search

Return to search

Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sak

Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-153873

simplified technical english

Datavetenskap (datalogi)

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-153873
Date	January 2018
Creators	Öhrström, Fredrik
Publisher	Linköpings universitet, Interaktiva och kognitiva system
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0014 seconds

Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sak

Description

Links & Downloads

Tags

Additional Fields