Return to search

Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance Tradeoffs

This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy.  Synthetic Minority Oversampling Technique  (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE)  yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:lnu-122052
Date January 2023
CreatorsHu, Li Ang, Ma, Long
PublisherLinnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM)
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0021 seconds