Return to search

Computer-Aided Optically Scanned Document Information Extraction System

This paper introduced a Computer-Aided Optically Scanned Document Information Extraction System. It could extract information including invoice No., issued date, buyer, etc., from the optically scanned document to meet the demand of customs declaration companies. The system output the structured information to a relational database. In detail, a software architecture for the information extraction of diverse-structure optically scanned document is designed. In this system, the original document is classified firstly. It would put into template-based extraction to improve the extraction performance if its template is pre-defined in the system. Then, a method for image enhancement to improve the image classification is proposed. This method aims to optimize the accuracy of neural network model by extracting the template-related feature and actively removing the unrelated feature. Lastly, the above system is implemented in this paper. This extraction are programed in Python which is a cross-platform languages. This system comprises three parts, classification module, template-based extraction and non-template extraction all of which have APIs and could be ran independently. This feature make this system flexible and easy to customization for the further demand. 445 real-world customs document images were input to evaluate the system. The result revealed that the introduced system ensured the diverse document support with non-template extraction and reached the overall high performance with template-based extraction showing the goal was basically achieved.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:miun-39190
Date January 2020
CreatorsMei, Zhijie
PublisherMittuniversitetet, Institutionen för informationssystem och –teknologi
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0025 seconds