![](https://crypto4nerd.com/wp-content/uploads/2024/04/1wsDgUPktsZ1FrvJh-Gv_6Q.png)
Big tech companies are always working on making sure they catch all those annoying spam emails and messages before they reach you. It’s a top priority for them to keep their customers happy and spam-free.Apple’s iMessage and Google’s Gmail are awesome at catching spam so you don’t have to deal with annoying spam messages. if you want to make a spam detection system, this article is perfect for you. I’ll show you how to detect spam using Machine Learning and Python.
Motivation
Features
Import Libraries
Data Loading
Data Preprocessing
Data Splitting
Model Training and Testing
Results
References
Embarking on the journey of creating a spam detection system holds immense promise and significance in our modern digital landscape. In an era where big tech companies prioritize customer satisfaction and strive tirelessly to combat spam, there arises an opportunity for us to contribute meaningfully to this ongoing battle. With Apple’s iMessage and Google’s Gmail setting the benchmark for spam detection, our endeavor to develop a similar system using Machine Learning and Python is not just a technical pursuit but a quest to enhance the online experience for countless individuals worldwide. By delving into this project, we embrace the chance to empower users, alleviate digital nuisances, and foster a safer, more enjoyable online environment. Let us embark on this journey with passion, curiosity, and determination, knowing that our efforts have the potential to make a tangible difference in the lives of many.
Certainly! Here are some key features for the spam detection system project:
Data Collection and Preprocessing: Implement a robust data collection mechanism to gather a diverse dataset of both spam and legitimate messages. Preprocess the data to extract relevant features and prepare it for training.
Machine Learning Models: Utilize various machine learning algorithms such as Naive Bayes, Support Vector Machines (SVM), or Neural Networks to train models for spam detection. Experiment with different models to determine the most effective approach.
Initial Accuracy: Starting with a baseline accuracy of approximately 90% with the existing classifier.
Accuracy Improvement Goal: Setting a target of achieving an 80% accuracy rate, signifying a substantial enhancement from the initial performance level.
Feature Engineering: Explore different feature engineering techniques to enhance the performance of the models. This may include TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings, or other text representation methods.
Model Evaluation: Develop a comprehensive evaluation strategy to assess the performance of the trained models. Utilize metrics such as accuracy, precision, recall, and F1-score to measure the effectiveness of the spam detection system.
Real-time Detection: Design the system to perform real-time spam detection, allowing users to receive immediate protection against spam messages as they arrive.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/SMS-Spam-Detection/master/spam.csv", encoding= 'latin-1')
data.head()
Features Selection
So, basically all we need from this dataset to train our spam detection model are the class and message columns. Let’s just grab those two and make it our new dataset.
data = data[["class", "message"]]
Input and Output Feature Selection
x = np.array(data["message"])
y = np.array(data["class"])
feature extraction
To use text data for predicting stuff, you gotta break it down and get rid of some words — that’s called tokenization. Then you gotta turn those words into numbers, either integers or floating-point values, so you can use them in machine learning. That whole thing is known as feature extraction (or vectorization).
CountVectorizer from Scikit-learn is like a cool tool that turns a bunch of text into a bunch of numbers by counting the words. You can also clean up the text before turning it into numbers. It’s a super handy feature for working with text data.
cv = CountVectorizer()
X = cv.fit_transform(x) # Fit the Data (# Encode the Document)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf.fit(X_train,y_train)
Accuracy
y_pred_NB = clf.predict(X_test)
NB_Acc=clf.score(X_test, y_test)
print('Accuracy score= {:.4f}'.format(clf.score(X_test, y_test)))
Now let’s test this model by taking a user input as a message to detect whether it is spam or not:
sample = input('Enter a message:')
data = cv.transform([sample]).toarray()
print(clf.predict(data))
Model gave 97% accuracy for email Prediction using Naive Bayes
Here you can find the complete code of project
🚀 Elevate Your Data Skills with Coursesteach! 🚀
Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!
🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️🗨️ Computer Vision, 🔬 Research — all in one place!
Enroll now for top-tier content and kickstart your data journey!
Stay tuned for our upcoming articles because we reach end to end ,where we will explore specific topics related to Deep Learning in more detail!
Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️
Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!
Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.
Together, let’s make this the best AI learning Community! 🚀