Return to search

運用光學字元辨識技術建置數位典藏全文資料庫之評估:以明人文集為例 / The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming

數位典藏是將物件以數位影像的形式進行典藏,並放置在網路系統供使用者瀏覽,能達到流通推廣與保存維護的效果。但在目前資訊爆炸的時代,數位典藏若僅透過詮釋資料描述是無法有效幫助使用者獲得內容資訊,唯有將之建置成全文檢索模式,才能方便使用者快速檢索到所需資訊,而光學字元辨識技術(簡稱OCR)能協助進行全文內容的輸出。
本研究藉由實際操作OCR軟體辨識明代古籍,探究古籍版式及影像對於軟體辨識結果之影響;藉由深度訪談訪問有實際參與數位典藏全文化經驗之機構人員,探究機構或個人對於計畫施行之觀點與考量。結果發現,雖然實際辨識結果顯示古籍版式與影像會對於OCR辨識有所影響,綜合訪談內容得知目前技術層面已克服古籍版式的侷限,但對於影像品質的要求仍然很高,意指古籍影像之品質對OCR的辨識影響程度最大;雖然OCR辨識技術已經有所突破,顯示能善用此技術協助進行全文資料庫的建立,但礙於技術陌生、經費預算、人力資源等因素,使得多數機構尚未運用此技術協助執行數位典藏全文化。
本研究建議,機構日後若有興趣執行數位典藏全文化計畫,首先,需要制定經常出適合機構執行的作業流程,並且瞭解自身欲處理物件之狀況,好挑選出適合的輸入處理模式;再者,需要多與技術廠商溝通協調,瞭解所挑選之物件是否符合處理上的成本效益;最後,綜合典藏機構與使用者之需求考量下,建議未來採取與OCR廠商合作的方式,由使用者自行挑選需要物件進行OCR辨識,校對完成後將全文內容回饋給典藏機構。這樣不僅能瞭解使用者需求為何,也能降低機構全文校對所耗費的成本。 / Digital Archives, placed in the network system for users to browse, change the collection into the digital images, and can help to preserve the collection and promote the content information. However, in the era of information explosion, Digital Archives can’t help users to retrieve the information in the collection by simply recording metadata. So, only when built into the full text retrieval can Digital Archives provide users with a quick retrieval of the information they want. And the Optical Character Recognition (OCR) can help to output the full text information.
The study explores the ancient books’ format and impact of image quality on the recognition results by recognizing the ancient books of the Ming dynasty with the OCR software. The study also explores institutional as well as individual views and considerations by in-depth interviewing institutional staff with experiences in the full text of Digital Archives plan. From the result we can discover that though the ancient books’ format and image quality do have influences on the recognition results, the overall interview suggests that the technology has overcome the limitation of the format under the high requirement for the image quality; that is, the quality of ancient books’ images is the most influential factor in the recognition results. Although the OCR already has the breakthrough in assisting the establishment of the full text database, most institutions have not yet applied this technology to full-textualization of the Digital Archives due to technical unfamiliar, budget, human resources and other factors.
The study suggests that if some day one institution is interested in working on the the full text of the Digital Archives project, it firstly needs to develop a proper SOP and needs to understand the conditions of their ready-to-be-textualized collections so that it can adopt a suitable input mode. Secondly, this institution needs to communicate with the OCR company more so that it can realize whether the chosen collection fits the cost-effectiveness. Finally, under the considerations of both the institution and users, the study suggests that institutions can cooperate with OCR companies in the future, so users can choose collections for OCR recognition on their own and give the full text to the institutions as feedback after proofreading. This can not only understand users’ needs but also reduce the cost of the proofreading for the institution.

Identiferoai:union.ndltd.org:CHENGCHI/G0104155017
Creators蔡瀚緯, Tsai, Han Wei
Publisher國立政治大學
Source SetsNational Chengchi University Libraries
Language中文
Detected LanguageEnglish
Typetext
RightsCopyright © nccu library on behalf of the copyright holders

Page generated in 0.0015 seconds