In this article, we describe a two-step processing pipeline for identifying text reuse of Shakespeare’s
Hamlet in a corpus of postmodern fiction by comparing n-grams from both sources. A key feature of
our approach lies in a pre-filtering step, in which we select target sentences in the fiction corpus that
are potential candidates for Hamlet text reuse. Without pre-filtering, the amount of text reuse pairs
(that are no actual quotes) would be extremely high. In a second filtering step, we compare potential
text reuse pairs by their vector representation using a neural network trained in an unsupervised
manner. We found that using the vector similarity produces a problematic amount of false positives.
The created vector representations are created using an unsupervised training approach, resulting in
similarity aspects that are unfavorable for our use case.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:92167 |
Date | 20 June 2024 |
Creators | Bryan, Maximilian, Burghardt, Manuel, Molz, Johannes |
Publisher | CEUR-WS.org |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0018 seconds