Nowadays, most of the documents are stored in electronic form and there is a high demand to organize and categorize them efficiently. Therefore, the field of automated text classification has gained a significant attention both from science and industry. This technology has been applied to information retrieval, information filtering, news classification, etc. The goal of this project is the automated text classification of photos as invoices or receipts in Visma Mobile Scanner, based on the previously extracted text. Firstly, several OCR tools available on the market have been evaluated in order to find the most accurate to be used for the text extraction, which turned out to be ABBYY FineReader. The machine learning tool WEKA has been used for the text classification, with the focus on the Naïve Bayes classifier. Since the Naïve Bayes implementation provided by WEKA does not support some advances in the text classification field such as N-gram, Laplace smoothing, etc., an improved version of Naïve Bayes classifier which is more specialized for the text classification and the invoice/receipt classification has been implemented. Improving the Naive Bayes classifier, investigating how it can be improved for the problem domain and evaluating the obtained classification accuracy compared to the generic Naïve Bayes are the main parts of this research. Experimental results show that the specialized Naïve Bayes classifier has the highest accuracy. By applying the Fixed penalty feature, the best result of 95.6522% accuracy on cross-validation mode has been achieved. In case of more accurate text extraction, the accuracy is even higher.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-54647 |
Date | January 2016 |
Creators | Kaci, Iuliia |
Publisher | Linnéuniversitetet, Institutionen för datavetenskap (DV) |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0017 seconds