![](https://crypto4nerd.com/wp-content/uploads/2023/07/1x0rt2efyQaeWbEr5zO7Tvw.png)
inline with the growth of information technology, Exchange information and data by email is the main key information internal company, to support compliance regarding law of personal data protection and make sure our asset save we should have strategy to maintain. Based on Verizon data breach investigation, On average a breach costs $3.93 million and these numbers vary by company size.
We all know that cyberthreats many way to prevent that activity, from Hardware, Software, People. We cant rely on Hardware and Software because the growth of cyberthreats is massive too, the main point that writer want stand out is for the people. Main Objective is to gain awareness from the people by make system to detect email phishing and spam that can be access easily for reduce potential lost cost from cyberthreats.
Big company which have more than 500 employee’s even they have many security layer like “Advance Threat Protection” we cant close our eyes about the fact there will be some case email phishing in main of mailbox. if the employee’s report IT support or IT Security Office we assume could reply on 5–7 minute to identify the email classification SPAM or HAM. How we minimalize the risk of spare time if employee’s received email and classified them to SPAM or HAM? in writer opinion is implementation of machine learning model to classified SPAM or HAM email.
Benefit
What the benefit of using machine learning to detection SPAM or HAM?
- Adaptability: Model could adapt and learn from new patterns and variations in spam email if we feed them continues data.
- Feature Extraction: extract the important feature to define where is SPAM or HAM.
- Improved Accuracy: by huge dataset the accuracy could increased and also learn the new pattern.
for the next chapter writer give the explanation from end to end machine learning workflow
Make Model
in this chapter could be more technical and also writer give the link for Github repository of this project, LINK.
this data set have 5572 email, the proportion 4825 HAM & 747 SPAM, because the data is clean than the “real” data in industry, writer didn’t EDA as much as usual.
for the feature engineering we decide to use Count Vectorizer (CV). Count Vectorizer (CV) is a feature extraction technique that converts text data into a matrix representation based on word frequencies. It enables machine learning algorithms to process and analyze text data.
Here’s a step-by-step breakdown of how Count Vectorizer (CV) works:
- Tokenization: The text is first divided into individual words or tokens. It removes punctuation, converts text to lowercase, and applies other preprocessing techniques.
- Vocabulary creation: Count Vectorizer builds a vocabulary of unique words or terms from the tokenized text. Each word becomes a feature, and its index in the vocabulary represents the column in the resulting matrix.
- Counting word frequencies: The algorithm counts the occurrence of each word in each document. It creates a matrix where the rows correspond to documents, and the columns represent the words from the vocabulary. The values in the matrix denote the frequency of each word in each document.
- Vectorization: The matrix is transformed into a numerical representation suitable for machine learning algorithms. This transformation typically involves converting the word frequencies to numerical values, such as binary indicators (presence or absence of a word) or raw term frequencies.
after we use CV, we should fit it to the model that we agree to use. in this case writer use Multinomial Naive Bayes to fit the data. First step is we separate the data to train, test.
after that we input the model and fit them
after that we want to know how good our model in test data, so we decide to make confusion matrix, from 1115 data set, we got score 974 True Positive, 17 False Positive, 9 False Negative, 115 False Negative
in writer opinion, the model have good understanding about the data. but for make sure we could continue to next step.
in this part we want to see the model could predict based on precision, recall and also f1-score, so we put this.
after get this score, the writer have satisfied with the model. and we want to advanced the step.
for make model could deploy in certain server we should dump it to certain extension, in this case writer use pickle.
for giving the understand in this deployment writer help by some application like, Github Desktop, Postman, CMD (terminal). the readers advice to install that application if want to stay inline with the context
Make some of repository on your github to run this project, and go to the terminal for activate virtual environment by using this syntax
you can check if the syntax work by mark (.venv) prefix before your location repository.
for this project we use there are some package that need to be install in our environment FastAPI, Scikit-learn, Pytest.
FastAPI is package that use for our project building APIs with Python.
scikit-learn is package for comperhensive machine learning library in Python.
pytest is a testing framework for Python that simplifies writing and executing tests.
and then we all know that library we installed have dependencies and also pre-requisite for other package, how we maintain the requirement document? in the python we use :
pip freeze > requirements.txt
for auto update the requirements of environment we needs.
FROM python:3.9-slim
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
RUN pip install -r requirements.txt
CMD ["python", "main.py"]
There will be usual syntax for making Docker Images :
1. From : we use base images of python 3.9
2. Env : which environment we use
3. Workdir : place the execute program
4. Copy: Copy all the path in docker container
5. Run: execution command for build process
6. CMD: Command for execute when container start
main objective of this section is to see expectation about our source code that has we dump and we post in main.py and UNIT test them with using pytest.
there are 3 main scenarion of unit test,
- Prediction about SPAM
def test_predict_spam(client):
input_data = {"text": "This is a promotion email to get discount"}
response = client.post("/predict", json=input_data)
assert response.status_code == 200
assert response.json() == {"prediction": "Spam"}
2. Prediction about HAM
def test_predict_ham(client):
input_data = {"text": "This is a test email"}
response = client.post("/predict", json=input_data)
assert response.status_code == 200
assert response.json() == {"prediction": "Ham"}
3. Return the message if empty
def test_predict_empty_input(client):
input_data = {"text": ""}
response = client.post("/predict", json=input_data)
assert response.status_code == 400
assert response.json() == {"detail": "Empty input text"}
and this is the evidence if writer code passed the UNIT test with certain criteria.
for the deployment in instace, writer use google cloud run for tackel this requirement. writer could explain step by step the setting, but the first time reader should have google account and go to Console.cloud.goole.com.
after that there will be like this, klik the burger in left menu->choose cloud run-> create service:
after that we choose continously deploy and click “SET UP CLOUD BUILD”
go to your account github to linked them and choose the right repository
for build configuration we use dockerfile -> save
after that we should setting about the instance region.
for CPU allocation we use this setting.
and for the capacity we use 4Gb Ram and 2 CPU
in security side we didn’t change the default
wait a moment and the service will be up like this:
for security reason i dont share the url, and the writter already finish the deployment.
after finish setting deployment, we need to check the output of our API text predict, and for this step writer using the POSTMAN application to help solve the problem testing.
- case we need to see about the SPAM detection, with the value
2. second case we need to see about the HAM detection, with this value
In this project, we successfully developed a machine learning model for spam detection using Naive Bayes in FastAPI. However, there is still room for improvement. In the future, we plan to explore more advanced techniques such as deep learning models and ensemble methods to further enhance the accuracy and robustness of our spam detection system.
also writer attached the repo of my github for the reference in this link
thank you for reading this article, if you find the insight in this article please share to everyone who need.