Background. Nowadays, information retrieval system become more and more popular, it helps people retrieve information more efficiently and accelerates daily task. Within this context, Image processing technology play an important role that help transcribing content in printed or handwritten documents into digital data in information retrieval system. This transcribing procedure is called document digitization. In this transcribing procedure, image processing technique such as layout analysis and word recognition are employed to segment the document content and transcribe the image content into words. At this point, a Swedish company (ArkivDigital® AB) has a demand to transcribe their document data into digital data. Objectives. In this study, the aim is to find out effective solution to extract document layout regard to the Swedish handwritten historical documents, which are featured by their tabular forms containing the handwritten content. In this case, outcome of application of OCRopus, OCRfeeder, traditional image processing techniques, machine learning techniques on Swedish historical hand-written document is compared and studied. Methods. Implementation and experiment are used to develop three comparative solutions in this study. One is Hessian filtering with mask operation; another one is Gabor filtering with morphological open operation; the last one is Gabor filtering with machine learning classification. In the last solution, different alternatives were explored to build up document layout extraction pipeline. Hessian filter and Gabor filter are evaluated; Secondly, filter images with the better filter evaluated at previous stage, then refine the filtered image with Hough line transform method. Third, extract transfer learning feature and custom feature. Fourth, feed classifier with previous extracted features and analyze the result. After implementing all the solutions, sample set of the Swedish historical handwritten document is applied with these solutions and compare their performance with survey. Results. Both open source OCR system OCRopus and OCRfeeder fail to deliver the outcome due to these systems are designed to handle general document layout instead of table layout. Traditional image processing solutions work in more than a half of the cases, but it does not work well. Combining traditional image process technique and machine leaning technique give the best result, but with great time cost. Conclusions. Results shows that existing OCR system cannot carry layout analysis task in our Swedish historical handwritten document. Traditional image processing techniques are capable to extract the general table layout in these documents. By introducing machine learning technique, better and more accurate table layout can be extracted, but comes with a bigger time cost. / Scalable resource-efficient systems for big data analytics
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:bth-17643 |
Date | January 2019 |
Creators | Liang, Xusheng |
Publisher | Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0025 seconds