Return to search

Into the bibliography jungle: using random forests to predict dissertations’ reference section

Cited-works-lists in Humanities dissertations are typically the result of five years of work. However,
despite the long-standing tradition of reference mining, no research has systematically untapped the
bibliographic data of existing electronic thesis collections. One of the main reasons for this is the
difficulty of creating a tagged gold standard for the around 300 pages long theses. In this short paper,
we propose a page-based random forest (RF) prediction approach which uses a new corpus of Literary
Studies Dissertations from Germany. Moreover, we will explain the handcrafted but computationally
informed feature-selection process. The evaluation demonstrates that this method achieves an F1 score
of 0.88 on this new dataset. In addition, it has the advantage of being derived from an interpretable
model, where feature relevance for prediction is clear, and incorporates a simplified annotation process.

Identiferoai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:92321
Date26 June 2024
CreatorsGutiérrez De la Torre, Silvia E., Niekler, Andreas, Equihua, Julián, Burghardt, Manuel
PublisherCEUR-WS.org
Source SetsHochschulschriftenserver (HSSS) der SLUB Dresden
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/publishedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0014 seconds