![](https://crypto4nerd.com/wp-content/uploads/2023/06/15CAsx2l3SgTrhUJurwO5gQ-1024x543.png)
There is a constant comparison being carried on between the complexity and difficulty level of machine learning applications on numerical data and textual data. I feel more comfortable working with numerical data whereas I had to revisit some topics while working on textual data.
In this article, we will explore Passive Aggressive Classifier present in scikit-learn library and see how well it performs using some performance metrics. The dataset used is called “Fake News” and can be retrieved from here.
Flask
Flask is a web framework written in Python that allows developers to build web applications quickly and efficiently. It is known for its simplicity and flexibility, making it a popular choice for developing web applications of various sizes and complexities. With Flask, you can handle routing, request handling, and template rendering, enabling you to focus on building the core functionality of your application. It provides a solid foundation for building dynamic and interactive web applications, making it an excellent choice for implementing the back-end of our Fake News Classifier project.
Jupyter Notebook
Jupyter Notebook is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and explanatory text. It provides an interactive environment for data analysis, experimentation, and collaboration, making it a popular tool among data scientists and researchers.
Github
GitHub is a web-based platform for version control and collaboration, allowing developers to store, manage, and share their code repositories. It provides a centralized location for hosting projects, facilitating collaboration, and tracking changes made to codebases. GitHub also offers additional features such as issue tracking and pull requests, making it an essential tool for software development and open-source contributions.
Problem Statement
In today’s era of social media and in herd of becoming an exclusive news reporting media, the audience often come across a news or a public statement being so extreme that it’s existence becomes questionable.
Objective
Our main objective is build a fake news classifier model and deploy it as a web-app. The model should be able to classify the news as real or fake entered by end-user with percentage of surety. The secondary objective is to learn working mechanism of Passive Aggressive Classifier.
👉 Task 1: Download and import the data set
The dataset can be downloaded from here. Let us import the data set into our working environment and have a look of it’s structure.
import pandas as pd
df=pd.read_csv(‘news.csv’)
df.head()
👉 Task 2: Let’s explore the dataset
Exploring a dataset starts with checking the presence of null values in the dataset followed by the shape of dataset(no. of rows & columns) and in this case it’s a classification problem, the symmetrical balance between both categories should be explored as it will play a crucial role in training the model. The categorical balance shows the number of real and fake news present in the dataset and the numbers show, it is a well balanced dataset.
print(“Shape of dataset:”,df.shape)
print(“nAny null values present:n”,df.isnull().sum())
print(“n Categorical Balance:n”,df[‘label’].value_counts())
👉 Task 3: Refining the dataset
Before applying any ML algorithm, let’s refine and prepare the dataset first. The first “Unnamed” column in not necessary for ML process, so we can proceed to drop this column. Secondly, the values under “label” column should be encoded into numeric form. Therefore, I used scikit-learn’s LabelEncoder library to do the job and get a new column with numeric values.
df.drop([‘Unnamed: 0’],axis=1,inplace=True)
from sklearn.preprocessing import LabelEncoder
lb=LabelEncoder()
df[‘Label’]=lb.fit_transform(df[‘label’])
👉 Task 4: Splitting the dataset
In machine learning projects, it is crucial to split the available data into separate training and testing sets. This division allows us to evaluate the model’s performance on unseen data and assess its generalization capabilities.
Employing the train_test_split function from the scikit-learn library, we divided the dataset into two subsets: X_train and X_test for the input features, and Y_train and Y_test for the corresponding output labels.
The purpose of splitting the data is to train the model on the training set and then evaluate its performance on the test set. By doing so, we can estimate how well the model is likely to perform when faced with new, unseen data.
The test_size parameter, set to 0.2 in this code , determines the proportion of the dataset allocated to the testing set. I have reserved 20% of the data for testing, while the remaining 80% will be used for training the model.
The random_state parameter, set to 0 in code snippet, ensures reproducibility of the results. By using the same random_state value, we can obtain the same train-test split each time the code is executed, allowing for consistent evaluation and comparison of the model’s performance.
X=df.iloc[:,1]
Y=df.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train, X_test , Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
👉 Task 5: Using the TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfid= TfidfVectorizer(stop_words=’english’)
train_tfid = tfid.fit_transform(X_train)
test_tfid = tfid.transform(X_test)
The TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer is a powerful feature extraction technique widely used in natural language processing tasks. In this project, I employed the TF-IDF vectorizer from the scikit-learn library to transform textual data into numerical representations suitable for machine learning algorithms.
The TF-IDF vectorizer converts the text documents in the dataset into numerical vectors, representing the importance of each word in a document relative to the entire corpus. It takes into account both the frequency of a word in a document (term frequency) and its rarity across all documents (inverse document frequency).
In above code snippet, I specified ‘english’ as the parameter for stop_words in the TF-IDF vectorizer. This instructs the vectorizer to ignore common English words such as “a,” “the,” “is,” etc., which do not provide significant discriminatory power and can skew the results.
I applied the TF-IDF vectorizer’s fit_transform method to the training data (X_train), which creates the TF-IDF matrix representation of the text. This matrix captures the importance of each word in each document, enabling the machine learning algorithm to understand the text data in a numerical format.
Similarly, I used the transform method of the TF-IDF vectorizer to convert the test data (X_test) into TF-IDF matrix form. It is crucial to transform the test data using the same vectorizer as the training data to ensure consistency and compatibility during evaluation.
The TF-IDF vectorizer is essential for text-based machine learning tasks as it captures the unique characteristics of each document by assigning high importance to rare and informative words while downplaying common and less informative words. This allows the model to focus on relevant features and disregard noise, leading to improved performance and more accurate predictions.
👉 Task 6: Model Training
from sklearn.linear_model import PassiveAggressiveClassifier
ps_model = PassiveAggressiveClassifier(max_iter=50)
ps_model.fit(train_tfid,Y_train)
Now as the text data is in matrix form we are ready to feed it to an algorithm for training the model. I used the PassiveAggressiveClassifier from the scikit-learn library to train the model on the transformed TF-IDF matrix, represented by the train_tfid variable. The classifier was instantiated with the max_iter parameter set to 50.
The Passive Aggressive classifier is based on the concept of online learning and employs an optimization algorithm to update its model parameters. The core idea is to make aggressive updates when misclassifications occur, while remaining passive and not updating the model if the predictions are correct.
Maths behind Passive Aggressive Algorithm
Let’s denote the training data as (X, y), where X represents the feature matrix (TF-IDF matrix in this case) and y represents the corresponding target labels. The Passive Aggressive algorithm aims to find a weight vector w that can accurately predict the class labels y given the input features X.
The objective function of the Passive Aggressive classifier is to minimize the hinge loss, which measures the margin between the predicted scores and the true labels. The hinge loss is defined as:
L(w) = max(0, 1 — y * (w^T * x))
where w^T denotes the transpose of the weight vector w, x represents a feature vector, and y is the corresponding true label (-1 or 1).
During training, the classifier updates the weight vector based on misclassifications. If a misclassification occurs, an aggressive update is performed to adjust the weight vector towards the correct label. The update rule is as follows:
w_new = w_old + (learning_rate * loss * x)
where w_new and w_old represent the updated and previous weight vectors, respectively, learning_rate is a hyperparameter controlling the step size of the update, loss is the hinge loss, and x is the feature vector of the misclassified instance.
However, to ensure the model remains passive and doesn’t overfit to noisy data, an upper bound or margin parameter © is introduced. This parameter controls the aggressiveness of the updates. If the loss is greater than C, the update is scaled down to prevent overfitting. The updated rule with the margin parameter is given by:
w_new = w_old + (min(loss, C) * learning_rate * x)
This margin parameter adds a regularization term to the optimization process, helping to balance between aggressive updates and model stability.
The algorithm iterates through the training instances, performing updates for each misclassification until convergence or a predefined number of iterations.
By iteratively updating the weight vector based on misclassifications while staying passive when predictions are correct, the Passive Aggressive classifier can adapt to changing data patterns and achieve good performance in real-time or dynamic environments.
👉 Task 7: Predictions & Testing
ps_predictions = ps_model.predict(test_tfid)
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
print("Accuracy score: ",accuracy_score(Y_test,ps_predictions))
print("Confusion Matrix: n",confusion_matrix(Y_test,ps_predictions))
print("f1 score: ",f1_score(Y_test, ps_predictions, pos_label='FAKE'))
The accuracy score measures the proportion of correctly classified instances. In our case, the model achieved an impressive accuracy score of 93.37%. This indicates that the classifier accurately classified 93.37% of the test instances, highlighting the effectiveness of our model in distinguishing between fake and real news .
The confusion matrix provides a detailed breakdown of the classification results. It consists of four elements: true positives (570), true negatives (613), false positives (45), and false negatives (39). The matrix shows the number of instances correctly and incorrectly classified for each class. By examining the confusion matrix, we can gain insights into the types of errors made by our classifier. In this case, the model made 45 false positive errors and 39 false negative errors.
The F1 score combines precision and recall into a single metric, providing a balanced assessment of the model’s performance. Our model achieved an F1 score of 0.9314. This score reflects the model’s ability to accurately classify fake news articles, considering both precision and recall.
To test the capability of this model, I created a sample news headline and fed it into the trained model. Here’s the code snippet for this process:
sample_news= [‘President obama is not performing good, he is a terrorist’]
test1= tfid.transform(sample_news)
ps_model.predict(test1)
In this code, I used the tfid.transform
function to transform the sample news article into a TF-IDF vector representation, which matches the format used during training. Then, we passed the transformed data to our trained Passive Aggressive classifier (ps_model.predict
) to predict the label of the sample news i.e. “FAKE”.
👉 Task 8: Saving the Model
We saved the trained Passive Aggressive classifier and TF-IDF vectorizer by using the pickle
module. Here’s the code snippet:
import pickle
pickle.dump(ps_model, open(‘classifier.pkl’, ‘wb’))
# Save the TF-IDF vectorizer
pickle.dump(tfid, open(‘tfidf_vectorizer.pkl’, ‘wb’))
In the code above, I used the pickle.dump
function to save the ps_model
and tfid
objects to separate files. This allows us to preserve the trained model and vectorizer for future use without having to retrain them.
Saving the model and vectorizer is crucial for practical deployment scenarios where we want to utilize the trained model to make predictions on new data without the need for retraining.
👉 Task 9: Building a Web Application
Now that our machine learning pipeline and model are ready we will start building a web application that can connect to them and generate predictions on new data in real-time. There are two parts of this application:
I have developed the front-end of my project using HTML, which is a standard markup language for creating web pages. HTML allows for the structuring and presentation of content on the web. By using HTML, I designed the user interface and layout for my application, ensuring an intuitive and visually appealing experience for users. This front-end code serves as the interface through which users can interact with the features and functionalities provided by the back-end of the application.
The provided code snapshot represents the back-end implementation web application using Flask.
Here’s a brief explanation of the code:
- The
pickle
module is imported to load the trained classifier model and TF-IDF vectorizer from the saved files. - The Flask framework is imported, which allows us to create and run the web application.
- The Flask app is initialized using
Flask(__name__)
, setting the current file as the main module. - In the
home
function, therender_template
function is used to render the “home.html” template, which serves as the main page of the application. - The
predict
function is decorated with the@app.route('/predict', methods=['POST'])
decorator. It handles the prediction process when the user submits the form with news data. - Inside the
predict
function, the user input is obtained from the form usingrequest.form.values()
. - The TF-IDF vectorizer is used to transform the input data using
tidf.transform()
. - The transformed data is then passed to the trained classifier model for prediction using
classifier_model.predict()
. - The predicted output is obtained, and the
render_template
function is used to render the “home.html” template again, passing the prediction result to be displayed in the template.
The if __name__ == '__main__':
block ensures that the Flask app is only run if the script is executed directly, rather than imported as a module.
This back-end code sets up the Flask web application, loads the trained model and vectorizer, handles user input, and performs predictions based on the provided data.
Conclusion
This project showcases the successful development of a Fake News Classifier using a Passive Aggressive algorithm. With an accuracy score of 93% and robust performance metrics, the model demonstrates its ability to distinguish between fake and real news articles effectively. The complete code for this project is available on my GitHub repository, where you can explore the implementation details and further enhance the classifier.