Return to search

Filtrování spamových zpráv pomocí metod umělé inteligence / Email spam filtering using artificial intelligence

In the modern world, email communication defines itself as the most used technology for exchanging messages between users. It is based on three pillars which contribute to the popularity and stimulate its rapid growth. These pillars are represented by free availability, efficiency and intuitiveness during exchange of information. All of them constitute a significant advantage in the provision of communication services. On the other hand, the growing popularity of email technologies poses considerable security risks and transforms them into an universal tool for spreading unsolicited content. Potential attacks may be aimed at either a specific endpoints or whole computer infrastructures. Despite achieving high accuracy during spam filtering, traditional techniques do not often catch up to rapid growth and evolution of spam techniques. These approaches are affected by overfitting issues, converging into a poor local minimum, inefficiency in highdimensional data processing and have long-term maintainability issues. One of the main goals of this master's thesis is to develop and train deep neural networks using the latest machine learning techniques for successfully solving text-based spam classification problem belonging to the Natural Language Processing (NLP) domain. From a theoretical point of view, the master's thesis is focused on the e-mail communication area with an emphasis on spam filtering. Next parts of the thesis bring attention to the domain of machine learning and artificial neural networks, discuss principles of their operations and basic properties. The theoretical part also covers possible ways of applying described techniques to the area of text analysis and solving NLP. One of the key aspects of the study lies in a detailed comparison of current machine learning methods, their specifics and accuracy when applied to spam filtering. At the beginning of the practical part, focus will be placed on the e-mail dataset processing. This phase was divided into five stages with the motivation of maintaining key features of the raw data and increasing the final quality of the dataset. The created dataset was used for training, testing and validation of types of the chosen deep neural networks. Selected models ULMFiT, BERT and XLNet have been successfully implemented. The master's thesis includes a description of the final data adaptation, neural networks learning process, their testing and validation. In the end of the work, the implemented models are compared using a confusion matrix and possible improvements and concise conclusion are also outlined.

Identiferoai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:413069
Date January 2020
CreatorsSafonov, Yehor
ContributorsUher, Václav, Kolařík, Martin
PublisherVysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií
Source SetsCzech ETDs
LanguageCzech
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/masterThesis
Rightsinfo:eu-repo/semantics/restrictedAccess

Page generated in 0.0022 seconds