![](https://crypto4nerd.com/wp-content/uploads/2024/03/1EZMrOglV6CweC8yNG5Exzg-1024x427.png)
Machine Learning (ML) is the science (and art) of developing algorithms to be executed as computer programs so that the machine can learn from data and use what it learned to provide meaningful outputs to its users. For example, an automated program that distinguishes spam and non-spam e-mails is a ML application — actually one of the first ML applications that became mainstream around the globe in 1990s.
Classification problems refer to the development of machine learning models with supervised learning in such a way that the instances in the dataset are mapped to pre-defined classes based on an algorithm. The ML model learns from the training dataset to be able to predict the class of a newly introduced data in a correct way. Classes are also called labels, targets, and categories in this context.
In order to introduce and implement ML algorithms which are commonly used for classification problems, the MNIST dataset is used in the following code examples. This dataset consists of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. The label of each image is the digit it represents.
Let’s start with getting the MNIST dataset and prepare it to be used in ML algorithms.
Scikit-Learn provides many helper functions to download popular datasets including the MNIST. The following code fetches the MNIST dataset from OpenML.org:
The fetch_openml function returns the features as a Pandas Dataframe and the labels as a Pandas Series (unless the dataset is sparse) in its default setting. However, using dataframes is not ideal in the case of the MNIST dataset as it contains images. This is the reason why as_frame is set to False in the above code to retrieve data as Numpy arrays.
There are 70,000 images in this dataset where each image has 784 features. The reason is that each image is 28 x 28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black).
Below code visualizes one digit from the dataset by reshaping the instance’s feature vector to a 28 x 28 array, and displaying it using Matplotlib’s imshow() function. cmap = “binary” is used to get a grayscale color map where 0 corresponds to white and 255 to black.
This looks like a 5, and indeed that’s what the label indicates.
Below are some more examples from the MNIST dataset:
One more step that should be performed before implementing ML classifiers is to split the dataset into train and test (and sometimes validation) to be able to measure its performance on the instances that are used in model construction and on the instances that are brand new to the ML model.
Now, the stage belongs to the ML algorithms to present their skills.
But wait, how do we compare these ML algorithms?!
To measure and compare the performances of different ML models on classification problems, there are commonly used performance measures namely accuracy, precision, recall, and F1 score. It is better to explain details of these metrics in another article to ensure that the focus of this one remains on the implementation of different ML models in Python. For now, the following code snippet is used to measure and compare the ML models:
Now, the stage really belongs to the ML algorithms!
There are six common ML algorithms that are used in classification problems which are Logistic Regression, Decision Tree, Random Forest, Gaussian Naive Bayes, Stochastic Gradient Descent, and Support Vector Machine. In the implementation of each of these algorithms, the same steps are followed:
- import the related library from Scikit-Learn
- create a classifier of the ML algorithm
- train the classifier based on the training dataset (remember that fit should not be used with any other dataset but training)
- use the trained classifier with the training dataset to obtain predictions for the training dataset
- use the trained classifier with the test dataset to obtain predictions for the test dataset
- call the function named performance that is defined above to obtain the performance of the ML model on the training and test datasets
(1) Logistic Regression
The idea behind Logistic Regression is estimating the probability of an event occurring based on a given dataset. It uses a transformation function which maps any real-valued number to a value between 0 and 1 to calculate that probability. In classification problems, the output of this transformation is interpreted as the probability of an instance belong a certain class.
The LogisticRegression classifier has the following performance measures:
accuracy on the training dataset: 0.94
precision on the training dataset: 0.94
recall on the training dataset: 0.94
f1 score on the training dataset: 0.94
accuracy on the test dataset: 0.92
precision on the test dataset: 0.92
recall on the test dataset: 0.92
f1 score on the test dataset: 0.92
(2) Decision Tree
The Decision Tree algorithm constructs hierarchical structures as decision trees where decisions that are based on features recursively split the data into subsets. Then, the prediction is made by traversing the tree from the root (the starting point of the decision tree) to leaf (the end points in each and every path of the decision tree).
The DecisionTree classifier has the following performance measures:
accuracy on the training dataset: 1.0
precision on the training dataset: 1.0
recall on the training dataset: 1.0
f1 score on the training dataset: 1.0
accuracy on the test dataset: 0.87
precision on the test dataset: 0.87
recall on the test dataset: 0.87
f1 score on the test dataset: 0.87
(3) Random Forest
The Random Forest algorithm is an optimized version of Decision Trees. Decision Trees face with a limitation on complexity since they cannot be grown to arbitrary complexity because of possible loss of generalization accuracy on unseen data. The limitation on complexity usually means suboptimal accuracy on the training dataset. Hence, Random Forest algorithm is proposed to overcome this situation. By relying on stochastic modeling, multiple trees are constructed whose capacity can be arbitrarily expanded for increases in accuracy for both the training and test datasets.
The RandomForest classifier has the following performance measures:
accuracy on the training dataset: 1.0
precision on the training dataset: 1.0
recall on the training dataset: 1.0
f1 score on the training dataset: 1.0
accuracy on the test dataset: 0.97
precision on the test dataset: 0.97
recall on the test dataset: 0.97
f1 score on the test dataset: 0.97
(4) Gaussian Naive Bayes
The Gaussian Naive Bayes algorithm belongs to the family of algorithms which uses the Bayes’ Theorem as the foundation. This family relies on a naive assumption that the features in the dataset are conditionally independent, given the class label. Despite this naive assumption, it is observed that Naive Bayes algorithms perform very well in a variety of real-world scenarios.
The Gaussian Naive Bayes algorithm extends this assumption in such a way that it also assumes that given the class label, the features follow a Gaussian distribution (and still they are assumed to be conditionally independent).
The GaussianNB classifier has the following performance measures:
accuracy on the training dataset: 0.55
precision on the training dataset: 0.68
recall on the training dataset: 0.55
f1 score on the training dataset: 0.51
accuracy on the test dataset: 0.55
precision on the test dataset: 0.68
recall on the test dataset: 0.55
f1 score on the test dataset: 0.51
(5) Stochastic Gradient Descent
The Stochastic Gradient Descent (SGD) algorithm is a variant of the Gradient Descent algorithm which is an optimization algorithm used to find the optimal solution for an objective function in an iterative manner. The primary goal of the Gradient Descent algorithm is to identify the model parameters which provide the maximum accuracy on both the training and test datasets.
In SGD, instead of using the entire dataset for each iteration, only a single training example that is selected randomly is used to calculate the gradient and update the model parameters. Since this random selection introduces randomness into the optimization process, the algorithm is named as the “stochastic” Gradient Descent.
One of the main advantages of the SGD is the computational efficiency it provided, especially when dealing with large datasets.
The SGD classifier has the following performance measures:
accuracy on the training dataset: 0.88
precision on the training dataset: 0.89
recall on the training dataset: 0.88
f1 score on the training dataset: 0.88
accuracy on the test dataset: 0.87
precision on the test dataset: 0.88
recall on the test dataset: 0.87
f1 score on the test dataset: 0.87
(6) Support Vector Machines
The Support Vector Machine (SVM) algorithm tries to find the optimal hyperplane in an N-dimensional (where N depends on the number of features) space that can separate the data points in different classes in the feature space. The hyperplane aims to achieve that the margin between the closest points of different classes should be as maximum as possible. The dimension of the hyperplane depends upon the number of features. If the number of input features is two, then the hyperplane is just a line. If the number of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of features exceeds three.
The SVC classifier has the following performance measures:
accuracy on the training dataset: 0.99
precision on the training dataset: 0.99
recall on the training dataset: 0.99
f1 score on the training dataset: 0.99
accuracy on the test dataset: 0.98
precision on the test dataset: 0.98
recall on the test dataset: 0.98
f1 score on the test dataset: 0.98
Based on these results on both the training and test datasets, the Random Forest and Support Vector Machine classifiers seem to outperform other implemented ML algorithms.