Global ETD Search

Return to search

Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms / Använda OCR-text för att extrahera kvittodata och klassificera kvitton med vanliga maskininlärnings algoritmer

This study investigated if it was feasible to use machine learning tools on OCR extracted text data to classify receipts and extract specific data points. Two OCR tools were evaluated, the first was Azure Computer Vision API and the second was Google Drive REST Api, where Google Drive REST Api was the main OCR tool used in the project because of its impressive performance. The classification task mainly tried to predict which of five given categories the receipts belongs to, and also a more challenging task of predicting specific subcategories inside those five larger categories. The data points we where trying to extract was the date of purchase on the receipt and the total price of the transaction. The classification was mainly done with the help of scikit-learn, while the extraction of data points was achieved by a simple custom made N-gram model. The results were promising with about 94 % cross validation score for classifying receipts based on category with the help of a LinearSVC classifier. Our custom model was successful in 72 % of cases for the price data point while the results for extracting the date was less successful with an accuracy of 50 %, which we still consider very promising given the simplistic nature of the custom model.

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-148350

Optical character recognition

Machine learning

Receipts

Information Systems

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-148350
Date	January 2018
Creators	Odd, Joel, Theologou, Emil
Publisher	Linköpings universitet, Institutionen för datavetenskap, Linköpings universitet, Institutionen för datavetenskap
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.002 seconds

Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms / Använda OCR-text för att extrahera kvittodata och klassificera kvitton med vanliga maskininlärnings algoritmer

Description

Links & Downloads

Tags

Additional Fields