Return to search

Investigating unsupervised feature learning for email spam classification

A dissertation submitted in partial ful llment of the requirements for the degree
Master of Science.
School of Computer Science and Applied Mathematics,
Faculty of Science,
University of the Witwatersrand, Johannesburg.
November 2017 / In the cyberspace, spam emails are used as a way to divulge sensitive information of
victims through social engineering. There are various classi cation systems that have
been employed previously to identify spam emails. The primary objective of email spam
classi cation systems is to classify incoming email as either legitimate (non-spam) or
spam emails. The spam classi cation task can thus be regarded as a two-class classi
cation problem. This kind of a problem involves the use of various classi ers such
as Decision Trees (DTs) and Support Vector Machines (SVMs). DTs and SVMs have
been shown to perform well on email spam classi cation tasks. Several studies have
failed to mention how these classi ers were optimized in terms of their hyperparameters.
As a result, poor performance was encountered with complex datasets. This is
because SVM classi er is dependent on the selection of the kernel function and the optimization
of kernel hyperparameters. Additionally, many studies on spam email ltering
task use words and characters to compute Term-Frequency (TF) based feature space.
However, TF based feature space leads to sparse representation due to the continuous
vocabulary growth. This problem is linked with the curse of dimensionality. Overcoming
dimensionality issues involves the use of feature reduction techniques. Traditional
feature reduction techniques, for instance, Information Gain (IG) may cause feature
representations to lose important features for identifying spam emails. This proposed
study demonstrates the use of Distributed Memory (DM), Distributed Bag of Words
(DBOW), Cosine Similarity (CS) and Autoencoder for feature representation to retain
a better class separability. Generated features enable classi ers to identify spam emails
in a lower dimension feature space. The use of the Autoencoder for feature reduction led
to improved classi cation performance. Furthermore, a comparison of kernel functions
and CS measure is taken into consideration to evaluate their impacts on classi ers when
employed for feature transformation. The study further shows that removal of more
frequent words, which have been regarded as noisy words and stemming process, may
negatively a ect the performance of the classi ers when word order is taken into consideration.
In addition, this study investigates the performance of DTs and SVM classi ers
on the publicly available datasets. This study makes a further investigation on the selection
of optimal kernel function and optimization of kernel hyperparameters for each
feature representation. It is further investigated whether the use of Stacked Autoencoder
as a pre-processing step for multilayer perceptron (MLP) will lead to improved
classi cation results. / MT 2018

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:wits/oai:wiredspace.wits.ac.za:10539/24027
Date January 2017
CreatorsDiale, Melvin
Source SetsSouth African National ETD Portal
LanguageEnglish
Detected LanguageEnglish
TypeThesis
FormatOnline resource (xi, 99 leaves), application/pdf

Page generated in 0.0026 seconds