Return to search

Imbalanced Data Classification with the K-Closest Resemblance Classifier for Remote Sensing and Social Media Texts

Data imbalance has been a challenge in many areas of automatic classification. Many popular approaches including over-sampling, under-sampling, and Synthetic Minority Oversampling Technique (SMOTE) have been developed and tested in previous research. A big problem with these techniques is that they try to solve the problem by modifying the original data rather than truly overcome the imbalance and let the classifiers learn. For tasks in areas like remote sensing and depression detection, the imbalanced data challenge also exists. Researchers have made efforts to overcome the challenge by adopting methods at the data pre-processing step. However, in remote sensing and depression detection tasks, the main interest is still on applying different new classifiers such as deep learning which has powerful classification ability but still do not consider data imbalance as prime factor of lower classification performance.

In this thesis, we demonstrate the performance of K-CR in our evaluation experiments on a urban land cover classification dataset and on two depression detection datasets. The latter two datasets consist in social media texts (tweets), therefore we propose to adopt a feature selection technique Term Frequency - Category-Based Term Weights (TF-CBTW) and various word embedding techniques (Word2Vec, FastText, GloVe, and language model BERT). This feature selection method was not applied before in similar settings and we show that it helps to improve the efficiency and the results of the K-CR classifier.

Our three experiments show that K-CR can achieve comparable performance on the majority classes and better performance on minority classes when compared to other classifiers such as Random Forest, K-Nearest Neighbour, Support Vector Machines, Multi-layer Perception, Convolutional Neural Networks, and Long Short-Term Memory.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/41424
Date10 November 2020
CreatorsDuan, Cheng
ContributorsInkpen, Diana
PublisherUniversité d'Ottawa / University of Ottawa
Source SetsUniversité d’Ottawa
LanguageEnglish
Detected LanguageEnglish
TypeThesis
Formatapplication/pdf

Page generated in 0.0023 seconds