Comparison Email Spam detection vectorizing using bag of word, TFIDF and Word2Vec in Multinomial Naïve Bayes

Rony Arifiandy; Hasanul Fahmi

Authors

Rony Arifiandy President University
Hasanul Fahmi President University, Indonesia

Abstract

Email has become very popular among people nowadays. In fact, it the cheapest, popular and fastest means of communication in recent times. Email also has become official communication media in business area. The popularity of email is also used by irresponsible people as a medium for sending fake news, as a medium for fraud and so on. We call this kind email as spam email. There are dangerous and not dangerous spam email. We will focus on detection dangerous spam email, there are 2 type dangerous spam email. The first is email Phishing: Phishing is a term used to define fraudulent practices in which spammers try to trick victims. This can be detrimental to the person who receives these emails. And this kind email may deliver massively and very disturbing the email user. This research will try to find better preprocessing text technique to support the Multinomial Naïve Bayes algorithm with 3 class (ham, phishing and fraud) to classify kind of email, it is hoped that it can help users more accurately classify spam emails. To be able to do that, in preprocessing data we need to vectorizing body email so machine learning can make calculation. Vectorization enables the machines to understand the textual contents by converting them into meaningful numerical representations. The effectiveness of various text vectorization methods, namely the bag of word, TF-IDF and word2vec are investigated for email spam detection using the Multinomial Naïve Bayes. The paper presents the comparative analysis of different vectorization methods on spam email dataset. This paper will give the best vectorization with Multinomial Naive Bayes.