Spam Email Classification

6 min readDec 20, 2020

Electronic mail (email) is significant for many kinds of group connection, which has become widely used by many people individuals, and organizations. At the same time, email is one of the fast-rising and costly problems linked with the internet today, in which this case it is called spam email. Spam emails are predominantly mercantile or have attractive links to famous websites but they lead to sites that are meddlesome.
Authors : Sudiksha Aapan (MT19018), Rupali (MT19095) , IIIT Delhi

Introduction:

Spam emails cause a lessening in privacy, spreading viruses, occupying space in the email box, and destroying email servers. Therefore, the user wastes a lot of time in filtering email imports and canceling the unwelcome email. The discovery of undesired emails categorizes the emails as spam or nonspam (ham), so this process is related to the classification problem.

In this work, we build the model to classify the email as spam or ham. We used the ENRON email dataset which consists of Ham and Spam emails. We have used various classification models such as Multinomial Naive Bias, Decision Tree, AdaBoost, KNN Classifier, and Random Forest. We have also build embedding such as count vectorizer, TFIDF, and hash vectorizer. The auto-classification system for classifying an email to spam/ham is built using ”FLASK”.

Data Information

For this project, the dataset chosen is the ENRON Email dataset. We combined all the Enron datasets including enron1,2,3,4 & 5 which are available publicly. The dataset is divided into spam and ham folder. It consists of 17,171 spam emails and 16,545 ham emails, which means the dataset is almost balanced. Both spam and ham datasets consist of two attributes with no missing values:

Email: consists of all the string of emails.
Target class: 0 for ham and 1 for spam.

Pre-Processing of Data

The preprocessing of data included removing punctuations, converting to lowercase, lemmatization, removing stop words. All this is done using the “NLTK” library.

Data Analysis:

Word Cloud for Spam and Ham datasets: We found that ”Subject” is the common word in spam emails and ”ect” is the most important word in ham emails.

Unigrams, Bigrams, and Trigrams: From this visualization on the complete dataset we found that the most frequent unigram in the complete dataset is “enron” and bigram is “hou ect” similarly trigram, is “hou etc ect”.

Word Encodings:

TF-IDF: Tf-IDF vectorizer converts a collection of raw documents into a matrix of TF-IDF features. The TF-IDF score is the product of the TF value and IDF value for each term. Term Frequency of term t is the number of times term ‘t’ present in the document. Inverse Document Frequency is document frequency is defined as the number of document ‘d’ that contain the term ‘t’. Inverse document frequency is the reverse of document frequency.
Count Vectorizer: Count Vectorizer converts the collection of text documents to a vector of token It only uses term frequency to represent a token.
Hash Vectorizer: The Hashing Vectorizer applies a hashing function to term frequency counts in each document. This one is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes.
BERT, RoBERTa and, XLM-RoBERTa: This framework provides the method to compute dense vector representations for sentences and paragraphs (also known as sentence embeddings). The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa. We found the embedding for each email and then append all the three embeddings for final feature vector of length 2816.

Machine Learning Models:

Multinomial Naive Bayes: This classifier is used on a huge dataset and at the same time it gives good results also. It predicts the basis of the labels on the concept of conditional probability and applying the Bayes theorem.

Decision Tree Classifier: work on a series of questions for testing and conditions in a tree structure. It divides data based on that criteria at each level till it can classify a set of data, creating an explainable classification system.

AdaBoost Classifier is one of the ensemble boosting classifier. An adaBoost classifier is a strong classifier as it works by combining multiple poor classifiers so that the resultant accuracy obtained is high.

KNN Classifier: stores all the available data and classifies a new data point based on the majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.

Random Forest Classifier: The ”forest” it builds, is an ensemble of decision trees, mostly trained with the “bagging” method. It is the combination of learning models that increases the overall accuracy of the model.

Support Vector Machine: The objective of the support vector machine algorithm is to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classifies the data points.

Stacking Classifier: is an ensemble-based machine learning model. We stacked different classifiers such as Multinomial naive Bayes, Extratree classifier, Random Forest, Logistic Regression, and SVM. Before using the stacking classifier, tune the models on various parameters.

The accuracies achieved on the prediction data using different Machine Learning Models are shown in the table given below.

F1 Score with BERT, RoBERTa and, XLM-RoBERTa

User Interface

Libraries Used: Flask

We developed an auto-classification system using the ”Flask” framework. Flask is a popular Python web framework, meaning it is a third-party Python library used for developing web applications. We have used flask because there are very few parts of Flask that cannot be easily and safely altered because of its simplicity and minimality.

A user can enter any email into the text field on the Input page, then our trained classifier classifies the email as spam/ham. We have finally shown our output on the Output page with each email ending with ”is a spam email”/”is a ham email” using the Stacking classifier trained over TFIDF vectorizer since it performed the best.

Contributions
Rupali : Data Visualization, Applying Machine Learning models, Embedding Generation (Bert/RoBERTa and, XLM-RoBERTa) , User Interface with Flask, Documentation.
Sudiksha Aapan : Data Extraction, Data preprocessing, Applying Machine Learning models, Embedding generation (TFIDF, Count and Hash Vectorizer),User Interface with Flask, Documentation
Acknowledgments:
Asst. Prof. Tanmoy Chakraborty https://www.iiitd.ac.in/tanmoy
Chhavi Jain (MTech IIITD) https://www.linkedin.com/in/chhavijain2212