• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Investigating unsupervised feature learning for email spam classification

Diale, Melvin January 2017 (has links)
A dissertation submitted in partial ful llment of the requirements for the degree Master of Science. School of Computer Science and Applied Mathematics, Faculty of Science, University of the Witwatersrand, Johannesburg. November 2017 / In the cyberspace, spam emails are used as a way to divulge sensitive information of victims through social engineering. There are various classi cation systems that have been employed previously to identify spam emails. The primary objective of email spam classi cation systems is to classify incoming email as either legitimate (non-spam) or spam emails. The spam classi cation task can thus be regarded as a two-class classi cation problem. This kind of a problem involves the use of various classi ers such as Decision Trees (DTs) and Support Vector Machines (SVMs). DTs and SVMs have been shown to perform well on email spam classi cation tasks. Several studies have failed to mention how these classi ers were optimized in terms of their hyperparameters. As a result, poor performance was encountered with complex datasets. This is because SVM classi er is dependent on the selection of the kernel function and the optimization of kernel hyperparameters. Additionally, many studies on spam email ltering task use words and characters to compute Term-Frequency (TF) based feature space. However, TF based feature space leads to sparse representation due to the continuous vocabulary growth. This problem is linked with the curse of dimensionality. Overcoming dimensionality issues involves the use of feature reduction techniques. Traditional feature reduction techniques, for instance, Information Gain (IG) may cause feature representations to lose important features for identifying spam emails. This proposed study demonstrates the use of Distributed Memory (DM), Distributed Bag of Words (DBOW), Cosine Similarity (CS) and Autoencoder for feature representation to retain a better class separability. Generated features enable classi ers to identify spam emails in a lower dimension feature space. The use of the Autoencoder for feature reduction led to improved classi cation performance. Furthermore, a comparison of kernel functions and CS measure is taken into consideration to evaluate their impacts on classi ers when employed for feature transformation. The study further shows that removal of more frequent words, which have been regarded as noisy words and stemming process, may negatively a ect the performance of the classi ers when word order is taken into consideration. In addition, this study investigates the performance of DTs and SVM classi ers on the publicly available datasets. This study makes a further investigation on the selection of optimal kernel function and optimization of kernel hyperparameters for each feature representation. It is further investigated whether the use of Stacked Autoencoder as a pre-processing step for multilayer perceptron (MLP) will lead to improved classi cation results. / MT 2018

Page generated in 0.1419 seconds