Global ETD Search

Return to search

Into the bibliography jungle: using random forests to predict dissertations’ reference section

Cited-works-lists in Humanities dissertations are typically the result of five years of work. However,
despite the long-standing tradition of reference mining, no research has systematically untapped the
bibliographic data of existing electronic thesis collections. One of the main reasons for this is the
difficulty of creating a tagged gold standard for the around 300 pages long theses. In this short paper,
we propose a page-based random forest (RF) prediction approach which uses a new corpus of Literary
Studies Dissertations from Germany. Moreover, we will explain the handcrafted but computationally
informed feature-selection process. The evaluation demonstrates that this method achieves an F1 score
of 0.88 on this new dataset. In addition, it has the advantage of being derived from an interpretable
model, where feature relevance for prediction is clear, and incorporates a simplified annotation process.

info:eu-repo/classification/ddc/006

ddc:006

info:eu-repo/classification/ddc/800

ddc:800

Identifer	oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:92321
Date	26 June 2024
Creators	Gutiérrez De la Torre, Silvia E., Niekler, Andreas, Equihua, Julián, Burghardt, Manuel
Publisher	CEUR-WS.org
Source Sets	Hochschulschriftenserver (HSSS) der SLUB Dresden
Language	English
Detected Language	English
Type	info:eu-repo/semantics/publishedVersion, doc-type:conferenceObject, info:eu-repo/semantics/conferenceObject, doc-type:Text
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.002 seconds

Into the bibliography jungle: using random forests to predict dissertations’ reference section

Description

Links & Downloads

Tags

Additional Fields