Return to search

Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms / Använda OCR-text för att extrahera kvittodata och klassificera kvitton med vanliga maskininlärnings algoritmer

This study investigated if it was feasible to use machine learning tools on OCR extracted text data to classify receipts and extract specific data points. Two OCR tools were evaluated, the first was Azure Computer Vision API and the second was Google Drive REST Api, where Google Drive REST Api was the main OCR tool used in the project because of its impressive performance. The classification task mainly tried to predict which of five given categories the receipts belongs to, and also a more challenging task of predicting specific subcategories inside those five larger categories. The data points we where trying to extract was the date of purchase on the receipt and the total price of the transaction. The classification was mainly done with the help of scikit-learn, while the extraction of data points was achieved by a simple custom made N-gram model. The results were promising with about 94 % cross validation score for classifying receipts based on category with the help of a LinearSVC classifier. Our custom model was successful in 72 % of cases for the price data point while the results for extracting the date was less successful with an accuracy of 50 %, which we still consider very promising given the simplistic nature of the custom model.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-148350
Date January 2018
CreatorsOdd, Joel, Theologou, Emil
PublisherLinköpings universitet, Institutionen för datavetenskap, Linköpings universitet, Institutionen för datavetenskap
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.002 seconds