![](https://crypto4nerd.com/wp-content/uploads/2023/02/10mUNhEgxyW2pEcotAM_rVg-1024x688.jpeg)
The Naive Bayes classifier has always fascinated me. It bends the rules, often assumes wrongly and fails in many cases, but it is still one of the most successful techniques for classification ever.
It is extensively used everywhere, by your mail client for spam detection to your favourite websites to recommend new products.
Learning to code Naive Bayes from scratch could be your first step into the world of Machine learning, or perhaps your deeper step into learning the theory behind it. But without further rambling, let’s get straight to it.
To introduce the maths behind the Naive Bayes classifier, we need to start with Bayes’ theorem. Bayes’ theorem is a way to calculate the probability of an event given some evidence. It is defined as follows:
To classify a new instance, the Naive Bayes classifier calculates the above probability for each class and chooses the class with the highest probability.
The “naive” assumption in the Naive Bayes classifier is that the features are independent of each other given the class variable. This means that we can write:
P(x1, x2, …, xn | C) = P(x1 | C) * P(x2 | C) * … * P(xn | C)
This assumption simplifies the calculations and makes the Naive Bayes classifier computationally efficient. However, it may not be true in practice, especially when the features are highly correlated.
We will initialize a class here which will ensure reusability and separation of concerns.
class CustomNaiveBayes:
def __init__(self, laplace_smoothing=True, smoothing_factor=1):
self._classes = None
self._class_priors = None
self._mean = None
self._var = Noneself._laplace_smoothing = laplace_smoothing
self._smoothing_factor = smoothing_factor
self._epsilon = 1e-9
Notably, we can also see we will be using Laplace smoothing. Laplace smoothing is often used to avoid the problem of zero probabilities. These can occur when a feature variable does not appear in the training dataset for a certain class, which can lead to zero probabilities and affect the final classification. Laplace smoothing adds a small value to all the probabilities to ensure that none of them is zero.
Let’s start with the most basic, part and that’s finding the possible classes within our data. This particular snippet gets all the y values of the dataset and transforms them into a unique list.
def _calculate_classes(self, y):
self._classes = list(set(y))
Great! Let’s now move to something a bit harder, and that’s calculating the priors on the dataset.
def _calculate_prior(self, y):occurances = [0] * len(self._classes)
for cl in self._classes:
for i in y:
if i == cl:
occurances[cl] += 1
no_of_samples = len(y)
if self._laplace_smoothing:
self._class_priors = [(occurances[i] + self._smoothing_factor) / (no_of_samples + (self._smoothing_factor * len(self._classes))) for i in range(len(self._classes))]
else:
self._class_priors = [occurances[i] / no_of_samples for i in range(len(self._classes))]
These priors are the main process of the dataset and are comprised of the occurrences of each class divided by the number of samples, multiplied by the Laplace smoothing factor.
We can then move on to calculating the means of each class
def _calculate_mean(self, X, y):
self._mean = []
occurances = [0] * len(self._classes)
for cl in self._classes:
mean = [0] * len(X[0])
for i in range(len(X)):
if y[i] == cl:
for j in range(len(X[0])):
mean[j] += X[i][j]
occurances[cl] += 1
for i in range(len(mean)):
mean[i] /= occurances[cl]
self._mean.append(mean)
What we’re doing here is calculating the mean of each class. This will contribute to `P(C)` hence knowing the prior probability of each class C.
Finally, we can move on to calculating the variances, which will be the final component required during our training.
def _calculate_variance(self, X, y):
self._var = []
occurances = [0] * len(self._classes)
for cl in self._classes:
var = [0] * len(X[0])
for i in range(len(X)):
if y[i] == cl:
for j in range(len(X[0])):
var[j] += (X[i][j] - self._mean[cl][j]) ** 2
occurances[cl] += 1
for i in range(len(var)):
var[i] /= occurances[cl]
self._var.append(var)
And that’s it! You have successfully implemented the fitting process for the entire Naive Bayes Classifier.
No there’s no use in fitting a model without using it right, so lets define the prediction function:
def predict(self, X):
posteriors = []
for x in X:
posterior = []
for idx, c in enumerate(self._classes):
logged_prior = []
for i in self._class_priors:
logged_prior.append(math.log(i))
prior = logged_prior[idx]
conditional = 0
pdf = self.gaussian(idx, x)
logged_pdf = []
for i in pdf:
logged_pdf.append(math.log(i + self._epsilon))
for i in logged_pdf:
conditional += i
posterior.append(prior + conditional)
posteriors.append(posterior)
This function takes a set of instances X as input and returns a list of posteriors for each instance, where each posterior represents the probability of belonging to each class.
For each instance x in X, a posterior list is created. For each class c in the classifier, the prior probability of the class (logged_prior) is calculated by taking the logarithm of the class prior probabilities (self._class_priors). This was fitted during training.
Then, the conditional probability of observing the feature variables given the class (conditional) is calculated using the Gaussian probability density function (pdf) for each feature variable. The Gaussian function is implemented in the self.gaussian method, which takes the index of the class (idx) and the feature vector (x) as input, and returns a list of probabilities for each feature variable.
The Gaussian method in the Naive Bayes classifier is used to estimate the probability density function (PDF) of each feature variable, given a specific class. The PDF of a feature variable describes the distribution of values that the variable can take on, and it is used to calculate the conditional probability of observing a specific set of feature values for a given class.
In the Gaussian Naive Bayes classifier, it is assumed that the PDF of each feature variable given a class follows a normal (Gaussian) distribution. This means that the PDF can be characterized by its mean and variance, which are estimated based on the training data for each class. The mean represents the center of the distribution, and the variance represents the spread of the distribution around the mean.
def gaussian(self, idx, x):
ans = []
for i in range(len(x)):
coeff = 1 / (math.sqrt(2 * math.pi * self._var[idx][i] + self._epsilon))exp_num = -1 * ((x[i] - self._mean[idx][i]) ** 2)
exp_den = (2 * self._var[idx][i]) + self._epsilon
ans.append(coeff * math.exp(exp_num / exp_den))
Going back to the predict function, to avoid issues with zero probabilities, the logarithm of each probability value is taken (logged_pdf), and a small value (self._epsilon) is added to each probability value before taking the logarithm. The conditional probability is then calculated by summing the logarithmic values of the probabilities.
Finally, the posterior probability for each class is calculated by adding the logged prior probability and the logged conditional probability. The posterior probability for each class is added to the posterior list for the instance x, and the posteriors list for all instances is returned at the end.
From the posteriors, we may then use the max function to find the class with the highest probability to be our target:
preds = []
for i in posteriors:
preds.append(self._classes[i.index(max(i))])
Et voila! The entire system is ready, and now we may fit and classify things using our very own naive bayes classifier! Feel free to play with the parameters, such as the smoothing factor, which may help better tune to your specific needs!
The final code including a small test on the iris can be found: