
Email messaging has become a very important part of our daily activities. You most likely have experienced some weird emails in your junk or spam box. Most Email clients like GMAIL use Machine Learning amongs other methods to detect Spam mail. Once detected, this goes to the Junk folder.
We are going to see how to build a Spam Email detection that leverages the power of Machine Learning. We will train our model over a dataset that includes sample spam content. Using the SciKit Learn library, we will apply Machine Learning algorithms to categories a give input as Spam or genuiune.
The Dataset
Data is a very crucial part of any Machine Learning or Artificial Intelligence model. We would be working with the SMS Spam Collection Dataset available on Kaggle https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
The SMS Spam Collection comprises a curated dataset of SMS messages, specifically gathered for research on SMS spam detection. Within this collection, there exists a corpus of 5,574 SMS messages in the English language. Each message in this dataset has been meticulously labeled as either ‘ham,’ signifying its legitimacy, or ‘spam,’ indicating its unsolicited or malicious nature.
As you can see, it has some content label as Spam and ham. We will preprocess this dataset and feed it into our ML algorithm to learn the patterns of a Spam mail. When we then feed a random Spam email not available in the dataset, it can predict it as Spam.
This can be use in any application that includes postings or a sort of mailing component in it. This can help reduce fradulent activities on communication sites and the likes.
Coupling the code
First, we have the necessary libraries installed:
!pip install numpy pandas scikit-learn nltk
Next, we import the installed libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
Load a dataset containing labeled email data (spam or not spam) and preprocess the text data. We also rename the csv header from v1 and v2 to label and text for better code readability
df = pd.read_csv("spam.csv", encoding="latin-1")
df = df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
df = df.rename(columns={"v1": "label", "v2": "text"})# Preprocess text data
df["text"] = df["text"].str.lower()
df["text"] = df["text"].apply(word_tokenize)
stop_words = set(stopwords.words("english"))
df["text"] = df["text"].apply(lambda x: [word for word in x if word not in stop_words])
df["text"] = df["text"].apply(lambda x: " ".join(x))
Next we begin the feature extraction using Count Vectorization to convert text data into numerical features. We also Split the dataset into training and testing sets and train using a classification model, such as Multinomial Naive Bayes. Then finally evaluate the Model.
#Feature Extraction
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df["text"])#Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, df["label"], test_size=0.2, random_state=42)
#Train a classification model, such as Multinomial Naive Bayes
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
#Evaluate the model's performance using metrics like accuracy and classification report
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(report)
Next we save the model so we can use it in our various applications
loaded_model = joblib.load('email_spam_model.pkl')
We can now use the loaded_model
to make predictions on new email data:
loaded_model = joblib.load('email_spam_model.pkl')
new_email = ["Congratulations! You've won a prize. Claim it now."]
new_email = vectorizer.transform(new_email) # Assuming you have the vectorizer from the previous code
prediction = loaded_model.predict(new_email)if prediction[0] == "spam":
print("This email is spam.")
else:
print("This email is not spam.")
In conclusion, email spam detection serves as a critical shield against the ever-persistent threat of unwanted and potentially harmful messages infiltrating our inboxes. By harnessing the power of advanced algorithms and machine learning techniques, we can efficiently sift through the deluge of emails, ensuring that only legitimate and meaningful communications reach their intended recipients.
Google Colab: https://colab.research.google.com/drive/1Dinu_7XXHJlWFlhi67EDBedVI1r5_Zbu?usp=sharing