Exploring N-Grams: The Building Blocks of Natural Language Understanding | by om pramod

Part 5: Delving Deeper into Advanced Smoothing Techniques.

2. Additive or Lidstone (add-k smoothing) Smoothing: Add-k smoothing, an extension of add-one smoothing, mitigates the extreme probability adjustments by adding a fractional count (k) to each observation rather than simply incrementing each count by one. In simpler words, Add-k smoothing is a variant of add-one smoothing but instead of adding 1 to each count as in add-one smoothing, add-k smoothing adds a fractional count, k, to each count. This results in a smaller shift of probability mass from observed to unobserved events. The formula for calculating smoothed probabilities using Additive Smoothing is:

Let’s consider a corpus with the following word frequencies:

“Apple”: 5
“Banana”: 3
“Cherry”: 2
“Dates”: 0 (Not present in the corpus)

Let’s assume the vocabulary size is 4 (including “Apple,” “Banana,” “Cherry,” and “Dates”).

Without Smoothing:

Without any smoothing, probabilities are calculated directly from the observed frequencies.

Laplace (Add-One) Smoothing:

Using Laplace smoothing with k=1:

Add-k Smoothing (Using k=0.5):

Using Add-k smoothing with k=0.5:

These calculations demonstrate how the probabilities change with different smoothing techniques.

By using add-k smoothing, we ensure that every word or n-gram has at least some probability, even if it didn’t occur in the training data. This helps to mitigate the sparsity problem and provides a more accurate representation of the language model.

Note that the smoothing parameter k is greater than 0. When k equals 0, no smoothing is applied. Selecting the right value for k is crucial for the effectiveness of add-k smoothing. It can be optimized by trying different values on a holdout set.

3. Good-Turing Smoothing or Good-Turing discounting: This method was developed by Alan Turing and his assistant I.J. Good during World War II for cracking German ciphers for the Enigma machine. It’s an enhancement over traditional smoothing techniques like Laplace Smoothing. Good-Turing smoothing makes an assumption that if two words appear the same number of times in the corpus, they have the same probability of occurring in general. This assumption significantly reduces the number of parameters required in the model, which simplifies the computation and reduces the risk of overfitting. Let’s consider a simple example. Suppose we have a corpus with the following words: {the, bad, cat, the, cat}. In this case, the word “the” appears twice, and the word “cat” also appears twice. According to the Good-Turing assumption, since these two words appear the same number of times in the corpus, we assume that they have the same probability of occurring in general. This assumption significantly reduces the number of parameters. Instead of estimating a unique probability for each word, we estimate a single probability for all words that occur the same number of times. This is much more manageable, especially when dealing with large vocabularies.

In Good-Turing smoothing, we introduce the notation Nr, which denotes the number of item types that occur exactly r times in the corpus. An “item type” could be a word, a bigram, a trigram, or any other unit depending on the context. Let’s consider a simple example. Suppose we have a corpus with the following words: {the, bad, cat, the, cat}. In this case:

N0 would be 0 because the word “dog” does not appear in the corpus at all.
N1 would be 1 because the word “bad” appears once in the corpus.
N2 would be 2 because the words “the” and “cat” each appear twice in the corpus.

So, Nr gives us a count of how many item types occur r times in the corpus. This is a crucial part of the Good-Turing smoothing formula, as it helps us estimate the probability of unseen events.

The adjusted count is calculated as:

This formula adjusts the counts of n-grams in the training data based on the counts of other n-grams, effectively redistributing some of the probability mass from n-grams that occur frequently to those that occur rarely or not at all. This helps to mitigate the sparsity problem in language modeling.

After adjusting counts, the probability of an event with a specific count is estimated using the adjusted count and the total number of events. The probability is calculated as:

For example, let’s consider bigrams (X) –

Let’s consider the event ‘BB’ from the table:

In this case, the event ‘BB’ occurs 2 times in the training data, there is 1 event that occurs 2 times, and 2 events that occur 3 times. So, the adjusted count ‘C*’ is (2+1)*(2)/(1) = 6, and the Good-Turing probability ‘P_GT(X)’ is 6 / 36 = 0.17.

Let’s take the n-gram “AB” as an example. It does not appear in the training data, so its original count (c) is 0. There are 4 n-grams that also do not appear in the training data ((N_0 = 4)), and there are 5 n-grams that appear once ((N_1 = 5)). So, the adjusted count (c*) is:

c∗ = (0+1) × 45 = 1.25

The total count of all n-grams (N) is 36, so the estimated probability of “AB” after Good-Turing smoothing is:

PGT(AB) = 361.25 ≈ 0.03

Consider another example, for the event ‘A.’, it occurs 4 times in the training data, there are 2 events that occur 4 times, and 1 event that occurs 5 times. So, the adjusted count ‘C*’ is (4+1)*(1)/(2) = 2.5, and the Good-Turing probability ‘PGT(X)’ is 2.5 / 36 = 0.07.

Consider first row of the table –

This suggests that, according to the Good-Turing smoothing method, the event ‘CC’ is not expected to occur in unseen data, despite it occurring 10 times in the training data. This is because there are no events in the training data that occur 11 times, which is taken as an indicator of the likelihood of an event occurring 10 times in the unseen data. This is a known limitation of the Good-Turing method when dealing with higher counts.

Closing note — As we conclude this segment, remember that every challenge is an opportunity for growth. Join us in Part 6: Strategies for Dealing with Out-of-Vocabulary Words. as we explore further advancements in natural language processing. Together, we’re pushing the boundaries of knowledge in NLP! Thank you for your continued involvement!