Return to search

“Embed, embed! There’s knocking at the gate.”

The detection of intertextual references in text corpora is a digital humanities topic that has gained a lot of attention in recent years. While intertextuality – from a literary studies perspective – describes the phenomenon of one text being present in another text, the computational problem at hand is the task of text similarity detection, and more concretely, semantic similarity detection. In this notebook, we introduce the Vectorian as a framework to build queries through word embeddings such as fastText and GloVe. We evaluate the influence of computing document similarity through alignments such as Waterman-Smith-Beyer and two variants of Word Mover’s Distance. We also investigate the performance of state-of-art sentence embeddings like Siamese BERT networks for the task - both as document embeddings and as contextual token embeddings. Overall, we find that Waterman-Smith-Beyer with fastText offers highly competitive performance. The notebook can also be used to upload new data for performing custom search queries.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:91774
Date30 May 2024
CreatorsBurghardt, Manuel, Liebl, Bernhard
PublisherMelusina Press, Zeitschrift für digitale Geisteswissenschaften
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:article, info:eu-repo/semantics/article, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess
Relation2510-1366, 978-2-919815-25-8

Page generated in 0.0023 seconds