Spelling suggestions: "subject:"duplicate"" "subject:"duplicated""
11 |
Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sakÖhrström, Fredrik January 2018 (has links)
Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated.
|
Page generated in 0.0632 seconds