Language modeling is the way of determining the probability of any sequence of words. Language modeling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modeling is the key aim behind implementing many state-of-the-art Natural Language Processing models.
Methods of Language Modelling
Two methods of Language Modeling:
- Statistical Language Modelling: Statistical Language Modeling, or Language Modeling, is the development of probabilistic models that can predict the next word in the sequence given the words that precede. Examples such as N-gram language modeling.
- Neural Language Modeling: Neural network methods are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A way of performing a neural language model is through word embeddings.
N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of text or speech. The items can be letters, words, or base pairs according to the application. The N-grams typically are collected from a text or speech corpus (A long text dataset).
For instance, N-grams can be unigrams like (“This”, “article”, “is”, “on”, “NLP”) or bigrams (“This article”, “article is”, “is on”, “on NLP”).
N-gram Language Model
An N-gram language model predicts the probability of a given N-gram within any sequence of words in a language. A well-crafted N-gram model can effectively predict the next word in a sentence, which is essentially determining the value of p(w∣h), where h is the history or context and w is the word to predict.
Let’s explore how to predict the next word in a sentence. We need to calculate p(w|h), where w is the candidate for the next word. Consider the sentence ‘This article is on…’.If we want to calculate the probability of the next word being “NLP”, the probability can be expressed as:
[Tex]p(\text{“NLP”} | \text{“This”}, \text{“article”}, \text{“is”}, \text{“on”})[/Tex]
To generalize, the conditional probability of the fifth word given the first four can be written as:
[Tex]p(w_5 | w_1, w_2, w_3, w_4) \quad \text{or} \quad p(W) = p(w_n | w_1, w_2, \ldots, w_{n-1})[/Tex]
This is calculated using the chain rule of probability:
[Tex]P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{and} \quad P(A \cap B) = P(A|B)P(B)[/Tex]
Now generalize this to sequence probability:
[Tex]P(X_1, X_2, \ldots, X_n) = P(X_1) P(X_2 | X_1) P(X_3 | X_1, X_2) \ldots P(X_n | X_1, X_2, \ldots, X_{n-1})[/Tex]
This yields:
[Tex]P(w_1, w_2, w_3, \ldots, w_n) = \prod_{i} P(w_i | w_1, w_2, \ldots, w_{i-1})[/Tex]
By applying Markov assumptions, which propose that the future state depends only on the current state and not on the sequence of events that preceded it, we simplify the formula:
[Tex]P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-k}, \ldots, w_{i-1})[/Tex]
For a unigram model (k=0), this simplifies further to:
[Tex]P(w_1, w_2, \ldots, w_n) \approx \prod_i P(w_i)[/Tex]
And for a bigram model (k=1):
[Tex]P(w_i | w_1, w_2, \ldots, w_{i-1}) \approx P(w_i | w_{i-1})[/Tex]
Implementing N-Gram Language Modelling in NLTK
# Import necessary librariesimport nltkfrom nltk import bigrams, trigramsfrom nltk.corpus import reutersfrom collections import defaultdict# Download necessary NLTK resourcesnltk.download('reuters')nltk.download('punkt')# Tokenize the textwords = nltk.word_tokenize(' '.join(reuters.words()))# Create trigramstri_grams = list(trigrams(words))# Build a trigram modelmodel = defaultdict(lambda: defaultdict(lambda: 0))# Count frequency of co-occurrencefor w1, w2, w3 in tri_grams: model[(w1, w2)][w3] += 1# Transform the counts into probabilitiesfor w1_w2 in model: total_count = float(sum(model[w1_w2].values())) for w3 in model[w1_w2]: model[w1_w2][w3] /= total_count# Function to predict the next worddef predict_next_word(w1, w2): """ Predicts the next word based on the previous two words using the trained trigram model. Args: w1 (str): The first word. w2 (str): The second word. Returns: str: The predicted next word. """ next_word = model[w1, w2] if next_word: predicted_word = max(next_word, key=next_word.get) # Choose the most likely next word return predicted_word else: return "No prediction available"# Example usageprint("Next Word:", predict_next_word('the', 'stock'))
Output:
Next Word: of
Metrics for Language Modelling
- Entropy: Entropy, as a measure of the amount of information conveyed by Claude Shannon. Below is the formula for representing entropy
[Tex]H(p) = \sum_{x} p(x)\cdot (-log(p(x)))\\[/Tex]
H(p) is always greater than equal to 0.
- Cross-Entropy: It measures the ability of the trained model to represent test data([Tex]W_{1}^{i-1}[/Tex]).
[Tex]H(p) =\sum_{i=1}^{x} \frac{1}{n} (-log_2(p(w_i | w_{1}^{i-1})))[/Tex]
The cross-entropy is always greater than or equal to Entropy i.e the model uncertainty can be no less than the true uncertainty.
- Perplexity: Perplexity is a measure of how good a probability distribution predicts a sample. It can be understood as a measure of uncertainty. The perplexity can be calculated by cross-entropy to the exponent of 2.
[Tex]2^{Cross-Entropy}[/Tex]
Following is the formula for the calculation of Probability of the test set assigned by the language model, normalized by the number of words:
[Tex]PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}}[/Tex]
For Example:
- Let’s take an example of the sentence: ‘Natural Language Processing’. For predicting the first word, let’s say the word has the following probabilities:
word | P(word | <start>) |
The | 0.4 |
Processing | 0.3 |
Natural | 0.12 |
Language | 0.18 |
- Now, we know the probability of getting the first word as natural. But, what’s the probability of getting the next word after getting the word ‘Language‘ after the word ‘Natural‘.
word | P(word | ‘Natural’ ) |
The | 0.05 |
Processing | 0.3 |
Natural | 0.15 |
Language | 0.5 |
- After getting the probability of generating words ‘Natural Language’, what’s the probability of getting ‘Processing‘.
word | P(word | ‘Language’ ) |
The | 0.1 |
Processing | 0.7 |
Natural | 0.1 |
Language | 0.1 |
- Now, the perplexity can be calculated as:
[Tex]PP(W) = \sqrt[n]{\prod_{i=1}^{N}\frac{1}{P(w_i | w_{i-1})}} = \sqrt[3]{\frac{1}{0.12 * 0.5 * 0.7}} \approx 2.876[/Tex]
- From that we can also calculate entropy:
[Tex]Entropy = log_2(2.876) = 1.524[/Tex]
Shortcomings:
- To get a better context of the text, we need higher values of n, but this will also increase computational overhead.
- The increasing value of n in n-gram can also lead to sparsity.
References
Next Article
What are Language Models in NLP?