![](https://crypto4nerd.com/wp-content/uploads/1DMTKJXglHnG_x5RwbKgKeQ.png)
The following below is my understanding of implementing the sentiment analysis of the from-scratch Naive Bayes Model without using any python modules like Scikit-Learn, Tensor, and Pytorch.
The dataset is based on Welsch Language which was provided by the professor of the NLP (Natural Language Processing) class I was enrolled in.
Introduction
The Naive Bayes classifier is one of the simple probabilistic classifiers used for processing text. It is based on the Bayes theorem that states,
According to this formula, we can find the probability of event A happening, given that event B has occurred. Hence, event B is the evidence and event A is the hypothesis. This gives us the assumption that the features are independent in nature and hence, they do not dependent on any other feature, calling it naive.
Importing the libraries
The following basic python libraries are imported to implement the model
Read the dataset
The dataset that I used was given by my professor based on Welsch Language. It was not processed and contained noises. The dataset is divided into the training file and testing file which included a different set of sentences. The training set included 80000 rows with 2 columns namely Target and Text the testing set includes 10000 rows with the same 2 columns.
Cleaning the dataset
The dataset is raw and contains a lot of unwanted noises that have to be removed in order to pass the data as input to the model. In this step, I am removing the following using the regex function that is imported from the ‘re’ module.
- HTML tags
- URLs
- non-alphanumeric characters
- emoji’s
- whitespaces
I am not removing stop words here because I did not come across many stop words in the Welsch Language.
Counting the frequency of pair of the {word, label}
I am defining a user-defined function that counts the frequency of the pair of words and its corresponding label (positive/negative)
According to the snippet above, our result is stored in a dictionary with the pair of words as the key and the frequency as the value. Below is an example of the output that I got.
Once the above process is done, the dictionary might raise an error while searching for a pair of words. In order to avoid this, I define a lookup function.
This function makes sure we don’t find the same pair of words in recurrence.
Calculating the frequency of all the unique words in the dictionary
In order to calculate the frequency of unique words in the dictionary, the below function is used. This function is the naive bayes model that is created from scratch.
In this step, we add up the probability of each word, and if the probability is greater than 0 then the result is concluded as positive and if it is less than 0, then we conclude that the text is negative.
I assumed the following while writing up this function. The results are:
Independent
According to the Naive Bayes concept, it is assumed that one variable is independent of another.
Relative frequency
Yet, you might come across some “clean” datasets that have been intentionally balanced to have an equal number of favorable and unfavorable tweets. Please keep in mind that the data may be somewhat noisier in the real world.
The model that I built has an accuracy of 0.781 which is pretty moderate. But, from this, we can make sure that the model is not overfitting.
Conclusion
In conclusion, the Naive Bayes model is a commonly used approach for sentiment analysis tasks, where the goal is to classify text as either positive, negative, or neutral. The Naive Bayes algorithm is a probabilistic classifier that calculates the probability of a document belonging to a particular class based on the probabilities of the words in the document.
Thank you for reading!
Connect with me on LinkedIn: https://www.linkedin.com/in/vijay-varshini-ln/