Manual invoice processing is a time-consuming and error prone task which has proven to be done more efficiently by introducing automation software that minimizes the need for human input. Amazon Textract is a software as a service provided by Amazon Web Services for that purpose. It has been developed to extract document data from both general and financial documents, such as receipts and invoices, by using machine learning models. The service is available in multiple widely spoken languages, but not in Swedish as of the time of writing this thesis. This thesis explores the potential and accuracy of Amazon Textract in extracting data from Swedish invoices by using the English setting. Specifically, the accuracy of extracting line items as well as Swedish letters are examined. In addition, the potential of correcting incorrectly extracted data is explored. This is achieved by testing certain defined categories on each invoice by comparing the Amazon Textract extractions with the correct labeled data. These categories include emptiness, meaning no data was extracted, equality, missing and added line items, as well as missing and added characters that have been added or removed from otherwise correct line item strings. The invoices themselves are divided into two categories, namely structured and semi-structured invoices. The tests are mainly conducted on the service’s dedicated API method for data extraction of financial documents, but a comparison with the table extraction API method is also made to gain more insight in Amazon Textract’s capability. The results suggest that Amazon Textract is quite inaccurate when extracting line item data from Swedish invoices. Therefore, manual post processing of the data is generally needed to ensure its correctness. However, it showed better results in extracting data from structured invoices, where it scored 70% in equality and 100% in 2 out of 6 invoice layouts. The Swedish character accuracy was 66%.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-504684 |
Date | January 2022 |
Creators | Kadir, Avin |
Publisher | Uppsala universitet, Datalogi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | UPTEC IT, 1401-5749 ; 23008 |
Page generated in 0.0027 seconds