Purpose. This work addresses an escalating problem within the realm of scientific publishing, that stems from accelerated publication rates of article formats difficult to process automatically. The amount of manual labour required to organise a comprehensive corpus of relevant literature has long been impractical. This has, in effect, reduced research efficiency and delayed scientific advancement. Two complementary approaches meant to alleviate this problem are detailed and improved upon beyond the current state-of-the-art, namely logical structure recovery of articles and keyphrase extraction. Methodology. The first approach targets the issue of flat-format publishing. It performs a structural analysis of the camera-ready PDF article and recognises its fine-grained organisation over logical units. The second approach is the application of a keyphrase extraction algorithm that relies on rhetorical information from the recovered structure to better contour an article’s true points of focus. A recount of the scientific article’s function, content and structure is provided, along with insights into how different logical components such as section headings or the bibliography can be automatically identified and utilised for higher-quality keyphrase extraction. Findings. Structure recovery can be carried out independently of an article’s formatting specifics, by exploiting conventional dependencies between logical components. In addition, access to an article’s logical structure is beneficial across term extraction approaches, reducing input noise and facilitating the emphasis of regions of interest. Value. The first part of this work details a novel method for recovering the rhetorical structure of scientific articles that is competitive with state-of-the-art machine learning techniques, yet requires no layout-specific tuning or prior training. The second part showcases a keyphrase extraction algorithm that outperforms other solutions in an established benchmark, yet does not rely on collection statistics or external knowledge sources in order to be proficient.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:618081 |
Date | January 2014 |
Creators | Constantin, Alexandru |
Publisher | University of Manchester |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | https://www.research.manchester.ac.uk/portal/en/theses/automatic-structure-and-keyphrase-analysis-of-scientific-publications(2cfe0b83-5cbb-4305-942c-031945437056).html |
Page generated in 0.0021 seconds