Classification of spam with the LSNB method

Česky
English

Thesis title:	Klasifikace spamu pomocí metody LSNB
Author:	Mareš, Jiří
Thesis type:	Bakalářská práce
Supervisor:	Kliegr, Tomáš
Opponents:	Berka, Petr
Thesis language:	Česky
Abstract:	Problematika spamu je s nezpomalujícím se růstem internetu stále více aktuální. Cílem bakalářské práce je implementace nově navrženého algoritmu Loosely symmetric naive Bayes založeném na využití kognitivních zkreslení pro přesnější a spolehlivější klasifikaci spamu z malých a nevyvážených datasetů. Jelikož autoři algoritmu nalezli neshodu mezi daty, která jsou použita k učení a daty, s kterými klasifikátor v praktické aplikaci pracuje, je LSNB jejich snahou o vytvoření modelu, který tuto neshodu dokáže spolehlivě vyřešit. K implementaci je použit programovací jazyk Python, z jehož knihovny scikit-learn řešení vychází.V teoretické části je obsažen úvod do problematiky spamu, jsou popsány obecné metody strojového učení a také konkrétní algoritmy použité v praktické části, zejména naivní Bayesův klasifikátor. Dále je uveden samotný teoretický model LSNB založený na využití kognitivního zkreslení k napodobení lidské schopnosti učit se. Jsou také zmíněny metody předzpracování dat, nejdůležitější použité knihovny jazyka Python a konečně i metriky, podle kterých se jednotlivé klasifikátory v praktické části porovnávají.V praktické části je podrobně popsán způsob implementace od předzpracování dat, přes učení klasifikátoru na trénovacích datech, po samotnou klasifikaci testovacích dat. Na konci je ukázáno fungování 6 zvolených klasifikátorů na 6 různě zkreslených a různě velkých datasetech podle metrik popsaných v teoretické části. Klasifikátor eLSNB, který byl v této práci implementován, dosáhl v porovnání s ostatními nejlepších výsledků a je vhodný k dalšímu testování.
Keywords:	strojové učení; kognitivní zkreslení; LSNB; klasifikátor; spam; Bayes

Thesis title:	Classification of spam with the LSNB method
Author:	Mareš, Jiří
Thesis type:	Bachelor thesis
Supervisor:	Kliegr, Tomáš
Opponents:	Berka, Petr
Thesis language:	Česky
Abstract:	The issue of spam is more and more concerning with the seemingly infinite growth of the internet. The aim of this bachelor thesis is the implementation of an algorithm based on the method Loosely symmetric naïve Bayes, for use in the classification of spam from small and biased datasets. Authors of the algorithm have found a discrepancy between the data that is used to train the classifier, and data that is found in practical applications. The LSNB model is created to deal with this discrepancy in a satisfying manner. The implementation is achieved with the Python programming language, whose extension scikit-learn is the basis for the implementation.The theoretical part presents an introduction to the issue of spam, then the general machine learning methods are introduced. But also, some of the specific methods are described in detail, in particular the naïve Bayes classifier. Afterwards, the theoretical LSNB model which attempts to use cognitive bias for reproducing human-level concept learning is described. Some data pre-processing techniques are introduced together with the basics of Python language and some of its most important extensions. Finally, the metrics for classifier evaluation are presented.In the practical part the implementation is described in detail from the data pre-processing, through model fitting from training data, to classification of testing data. Evaluation of 6 different classifiers on 6 datasets of various sizes and various biases is shown near the end. For this, the metrics described in the theoretical part are used. The eLSNB classifier, which has been implemented as a part of this thesis, has performed the best in comparison with the other classifiers. It is deemed suitable for further testing.
Keywords:	Bayes; classifier; cognitive bias; LSNB; machine learning; spam

Information about study

Study programme:	Aplikovaná informatika/Aplikovaná informatika
Type of study programme:	Bakalářský studijní program
Assigned degree:	Bc.
Institutions assigning academic degree:	Vysoká škola ekonomická v Praze
Faculty:	Faculty of Informatics and Statistics
Department:	Department of Information and Knowledge Engineering

Information on submission and defense

Date of assignment:	10. 10. 2019
Date of submission:	11. 5. 2020
Date of defense:	15. 6. 2020
Identifier in the InSIS system:	https://insis.vse.cz/zp/71213/podrobnosti

Files for download

Main text
71213_marj39.pdf, 1.9 MB Download

Opponent's review
66588_berka.pdf, 66 kB Download

Supervisor's review
71213_klit01.pdf, 63.1 kB Download