Cited-works-lists in Humanities dissertations are typically the result of five years of work. However,
despite the long-standing tradition of reference mining, no research has systematically untapped the
bibliographic data of existing electronic thesis collections. One of the main reasons for this is the
difficulty of creating a tagged gold standard for the around 300 pages long theses. In this short paper,
we propose a page-based random forest (RF) prediction approach which uses a new corpus of Literary
Studies Dissertations from Germany. Moreover, we will explain the handcrafted but computationally
informed feature-selection process. The evaluation demonstrates that this method achieves an F1 score
of 0.88 on this new dataset. In addition, it has the advantage of being derived from an interpretable
model, where feature relevance for prediction is clear, and incorporates a simplified annotation process.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:92321 |
Date | 26 June 2024 |
Creators | Gutiérrez De la Torre, Silvia E., Niekler, Andreas, Equihua, Julián, Burghardt, Manuel |
Publisher | CEUR-WS.org |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.002 seconds