Global ETD Search

Return to search

The Document Similarity Network: A Novel Technique for Visualizing Relationships in Text Corpora

With the abundance of written information available online, it is useful to be able to automatically synthesize and extract meaningful information from text corpora. We present a unique method for visualizing relationships between documents in a text corpus. By using Latent Dirichlet Allocation to extract topics from the corpus, we create a graph whose nodes represent individual documents and whose edge weights indicate the distance between topic distributions in documents. These edge lengths are then scaled using multidimensional scaling techniques, such that more similar documents are clustered together. Applying this method to several datasets, we demonstrate that these graphs are useful in visually representing high-dimensional document clustering in topic-space.

Other Computer Sciences

Other Mathematics

Identifer	oai:union.ndltd.org:CLAREMONT/oai:scholarship.claremont.edu:hmc_theses-1105
Date	01 January 2017
Creators	Baker, Dylan
Publisher	Scholarship @ Claremont
Source Sets	Claremont Colleges
Detected Language	English
Type	text
Format	application/pdf
Source	HMC Senior Theses
Rights	© 2017 Dylan K. Baker, default

Page generated in 0.0021 seconds

The Document Similarity Network: A Novel Technique for Visualizing Relationships in Text Corpora

Description

Links & Downloads

Tags

Additional Fields