Since Altman’s 1968 discriminant analysis model for corporate bankruptcy prediction, there have been numerous studies applying statistical and machine learning (ML) models in predicting bankruptcy under various contexts. ML models have been proven to be highly accurate in bankruptcy prediction up to three years before the event, more so than statistical models. A major limitation of ML models is that they suffer from an inability to handle highly imbalanced datasets, which has resulted in the development of a plethora of oversampling and undersampling methods for addressing class imbalances. However, current research on the impact of different sampling methods on the predictive performance of ML models is fragmented, inconsistent, and limited. This thesis investigated whether the choice of sampling method led to significant differences in the performance of five predictive algorithms: logistic regression, multiple discriminant analysis(MDA), random forests, Extreme Gradient Boosting (XGBoost), and support vector machines(SVM). Four oversampling methods (random oversampling (ROWR), synthetic minority oversampling technique (SMOTE), oversampling based on propensity scores (OBPS), and oversampling based on weighted nearest neighbour (WNN)) and three undersampling methods (random undersampling (RU), undersampling based on clustering from nearest neighbour (CFNN), and undersampling based on clustering from Gaussian mixture methods (GMM) were tested. The dataset was made up of non-listed Swedish restaurant businesses (1998 – 2021) obtained from the business registry of Sweden, having 10,696 companies with 335 bankrupt instances. Results, assessed through 10-fold cross-validated AUC scores, reveal those oversampling methods generally outperformed undersampling methods. SMOTE performed highest in four of five algorithms, while WNN performed highest with the random forest model. Results of Wilcoxon’s signed rank test showed that some differences between oversampling and undersampling were statistically significant, but differences within each group were not significant. Further, results showed that while the XGBoost had the highest AUC score of all predictive algorithms, it was also the most sensitive to different sampling methods, while MDA was the least sensitive. Overall, it was concluded that the choice of sampling method can significantly impact the performance of different algorithms, and thus users should consider both the algorithm’s sensitivity and the comparative performance of the sampling methods. The thesis’s results challenge some prior findings and suggests avenues for further exploration, highlighting the importance of selecting appropriate sampling methods when working with highly imbalanced datasets.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:du-48512 |
Date | January 2024 |
Creators | Mahembe, Wonder |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0024 seconds