1 |
Analyzing and Navigating Electronic Theses and DissertationsAhuja, Aman 21 July 2023 (has links)
Electronic Theses and Dissertations (ETDs) contain valuable scholarly information that can be of immense value to the scholarly community. Millions of ETDs are now publicly available online, often through one of many digital libraries. However, since a majority of these digital libraries are institutional repositories with the objective being content archiving, they often lack end-user services needed to make this valuable data useful for the scholarly community. To effectively utilize such data to address the information needs of users, digital libraries should support various end-user services such as document search and browsing, document recommendation, as well as services to make navigation of long PDF documents easier. In recent years, with advances in the field of machine learning for text data, several techniques have been proposed to support such end-user services. However, limited research has been conducted towards integrating such techniques with digital libraries.
This research is aimed at building tools and techniques for discovering and accessing the knowledge buried in ETDs, as well as to support end-user services for digital libraries, such as document browsing and long document navigation. First, we review several machine learning models that can be used to support such services. Next, to support a comprehensive evaluation of different models, as well as to train models that are tailored to the ETD data, we introduce several new datasets from the ETD domain. To minimize the resources required to develop high quality training datasets required for supervised training, a novel AI-aided annotation method is also discussed. Finally, we propose techniques and frameworks to support the various digital library services such as search, browsing, and recommendation. The key contributions of this research are as follows:
- A system to help with parsing long scholarly documents such as ETDs by means of object-detection methods trained to extract digital objects from long documents. The parsed documents can be used for further downstream tasks such as long document navigation, figure and/or table search, etc.
- Datasets to support supervised training of object detection models on scholarly documents of multiple types, such as born-digital and scanned. In addition to manually annotated datasets, a framework (along with the resulting dataset) for AI-aided annotation also is proposed.
- A web-based system for information extraction from long PDF theses and dissertations, into a structured format such as XML, aimed at making scholarly literature more accessible to users with disabilities.
- A topic-modeling based framework to support exploration tasks such as searching and/or browsing documents (and document portions, e.g., chapters) by topic, document recommendation, topic recommendation, and describing temporal topic trends. / Doctor of Philosophy / Electronic Theses and Dissertations (ETDs) contain valuable scholarly information that can be of immense value to the research community. Millions of ETDs are now publicly available online, often through one of many online digital libraries. However, since a majority of these digital libraries are institutional repositories with the objective being content archiving, they often lack end-user services needed to make this valuable data useful for the scholarly community. To effectively utilize such data to address the information needs of users, digital libraries should support various end-user services such as document search and browsing, document recommendation, as well as services to make navigation of long PDF documents easier and accessible. Several advances in the field of machine learning for text data in recent years have led to the development of techniques that can serve as the backbone of such end-user services. However, limited research has been conducted towards integrating such techniques with digital libraries. This research is aimed at building tools and techniques for discovering and accessing the knowledge buried in ETDs, by parsing the information contained in the long PDF documents that make up ETDs, into a more compute-friendly format. This would enable researchers and developers to build end-user services for digital libraries. We also propose a framework to support document browsing and long document navigation, which are some of the important end-user services required in digital libraries.
|
2 |
More Obstacles for the Graduate Student Author: Open Access ETDs Trigger Plagiarism DetectorsDawson, DeDe, Langrell, Kate 14 December 2023 (has links) (PDF)
Supporting graduate students as authors is one of the many services we provide at the University Library, University of Saskatchewan (USask). Graduate students often submit articles to journals based on content from their electronic theses or dissertations (ETDs). Recently, we have noticed an increase in the number of such article submissions being flagged for possible rejection on “plagiarism” or “prior publication” grounds. We suspect this may be because plagiarism detection software is increasingly being integrated into publishers’ article submission systems. This software is triggered by the existence of the student’s open access (OA) ETD in our institutional repository. This happens despite OA ETD inclusion in repositories being a common practice and despite journal policies often allowing submission of articles based on ETDs. We review common practices and guidelines around publishing of ETD content, two recent cases of journals initially rejecting such submissions by graduate student authors of our institution, and our reflections on this issue and how to address it.
|
3 |
Improving the Accessibility of Arabic Electronic Theses and Dissertations (ETDs) with Metadata and ClassificationAbdelrahman, Eman January 2021 (has links)
Much research work has been done to extract data from scientific papers, journals, and articles. However, Electronic Theses and Dissertations (ETDs) remain an unexplored genre of data in the research fields of natural language processing and machine learning. Moreover, much of the related research involved data that is in the English language. Arabic data such as news and tweets have begun to receive some attention in the past decade. However, Arabic ETDs remain an untapped source of data despite the vast number of benefits to students and future generations of scholars. Some ways of improving the browsability and accessibility of data include data annotation, indexing, parsing, translation, and classification. Classification is essential for the searchability and management of data, which can be manual or automated. The latter is beneficial when handling growing volumes of data. There are two main roadblocks to performing automatic subject classification on Arabic ETDs. The first is the unavailability of a public corpus of Arabic ETDs. The second is the Arabic language’s linguistic complexity, especially in academic documents. This research presents the Otrouha project, which aims at building a corpus of key metadata of Arabic ETDs as well as providing a methodology for their automatic subject classification. The first goal is aided by collecting data from the AskZad Digital Library. The second goal is achieved by exploring different machine learning and deep learning techniques. The experiments’ results show that deep learning using pretrained language models gave the highest classification performance, indicating that language models significantly contribute to natural language understanding. / M.S. / An Electronic Thesis or Dissertation (ETD) is an openly-accessible electronic version of a graduate student’s research thesis or dissertation. It documents their main research effort that has taken place and becomes available in the University Library instead of a paper copy. Over time, collections of ETDs have been gathered and made available online through different digital libraries. ETDs are a valuable source of information for scholars and researchers, as well as librarians. With the digitalization move in most Middle Eastern Universities, the need to make Arabic ETDs more accessible significantly increases as their numbers increase. One of the ways to improve their accessibility and searchability is through providing automatic classification instead of manual classification. This thesis project focuses on building a corpus of metadata of Arabic ETDs and building a framework for their automatic subject classification. This is expected to pave the way for more exploratory research on this valuable genre of data.
|
Page generated in 0.0218 seconds