Great progress has been made to leverage the improvements made in natural language processing and machine learning to better mine data from journals, conference proceedings, and other digital library documents. However, these advances do not extend well to book-length documents such as electronic theses and dissertations (ETDs). ETDs contain extensive research data; stakeholders -- including researchers, librarians, students, and educators -- can benefit from increased access to this corpus. Challenges arise while working with this corpus owing to the varied nature of disciplines covered as well as the use of domain-specific language. Prior systems are not tuned to this corpus. This research aims to increase the accessibility of ETDs by the automatic classification of chapters of an ETD using machine learning and deep learning techniques. This work utilizes an ETD-centric target classification system. It demonstrates the use of custom trained word and document embeddings to generate better vector representations of this corpus. It also describes a methodology to leverage extractive summaries of chapters of an ETD to aid in the classification process. Our findings indicate that custom embeddings and the use of summarization techniques can increase the performance of the classifiers. The chapter-level labels generated by this research help to identify the level of interdisciplinarity in the corpus. The automatic classifiers can also be further used in a search engine interface that would help users to find the most appropriate chapters. / Master of Science / Electronic Theses and Dissertations (ETDs) are submitted by students at the end of their academic study. These works contain research information pertinent to a given field. Increasing the accessibility of such documents will be beneficial to many stakeholders including students, researchers, librarians, and educators. In recent years, a great deal of research has been conducted to better extract information from textual documents with the use of machine learning and natural language processing. However, these advances have not been applied to increase the accessibility of ETDs. This research aims to perform the automatic classification of chapters extracted from ETDs. That will reduce the human effort required to label the key parts of these book-length documents. Additionally, when considered by search engines, such categorization can aid users to more easily find the chapters that are most relevant to their research.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/99294 |
Date | 07 July 2020 |
Creators | Jude, Palakh Mignonne |
Contributors | Computer Science, Fox, Edward A., North, Christopher L., Karpatne, Anuj |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0022 seconds