# Exploring NLP with Coursera - Part 4

During Week 1, I realised that taking notes that mimic the content doesn’t make sense as much, thus for Week 2, the notes cover only highlights and important outcomes.

# Basics of Language Modeling

Let’s say we have the text “This is the…” and 4 possible choices

- house
- rat
- did
- malt

For humans, this might not be a trivial question and we could say that for example did doesn’t really fit here in any context. For machines, its a bit more challenging.

We can help machines to “understand” better using probability theory. For example, what is the probability of seeing the word ‘house’ given ‘this is the’ or

\[p(house|this\ is\ the)\]In terms of language modeling we can also ask the probability of the whole sequence. Given our training data how likely are we to see a specific sequence. For example

\[p(this\ is\ the\ house)\]Mathematically, we want to predict the probability of a sequence of words *This is the house*, which we can annotate as

Thus the probability of it would be

\[p(w)=p(w_1w_2w_3...w_k)\]Using the **Chain Rule** we can rewrite the equation as such

Now the challenge here is that the list of probabilities will explode exponentially making it almost impossible to estimate probabilities. Nevertheless, we can use the Markov Assumption to only take the last *n-1* terms. So if we want to look at *trigrams* then we would condition our probabities on last 2 items in the sequence.

Using Markov Assumption we get

\[p(w_i|w_1...w_{i-1})=p(w_i|w_{i-n+1}...w_{i-1})\]Now, if we use the following probability calculation to our individual probabilities won’t add to one and for that we need to add additional start and end tokens to answer the question what is the probability of seeing specific word given its the start and end of the sequence.

\[p(w_1|start)\]This helps us normalize the probabilities for each sequence.

Thus at the for a n-gram Language Model we have the following representation

\[p(w)=\prod_{i=1}^{k+1}p(w_i|w_{i-n+1}^{i-1})\]and the model estimate is simply the countings for those word representations

To train the N-gram Language Model, we simply maximize the log likelihood

The large N represents the length of the train corpus (all words concatenated)

# Smoothing: Dealing with un-seen n-grams

Now there will be situations where in the test data we observe text (n-grams) that is not present in the training data, essentially leading to zero probabilities. To deal with these situations, various smoothing techniques could be used.

Essentially smoothing tries to pull some probability from frequent n-grams to infrequent ones.

Let’s review various smoothing techniques. This section is pretty straight forward since it simply summarizes various approaches.

## Laplacian Smoothing

## Katz Backoff

Here and *p* and *alpha* are chosen to ensure normalization.

## Interpolation Smoothing

## Absolute Discounting

The approach was developed as a result of the experiment made by Church & Gale in 1991. Experiment available here.

## Kneser-Ney Smoothing

This smoothing technique is the most popular smoothing technique mostly because it captures the diversity of contexts for the word. Example here is that the word Kong is less likely to occur with *This is the…* because it only goes with Hong Kong but the word malt can be seen in various context, thus more likely to be preceded by *This is the*.

## Perplexity Computation

Perplexity is a popular quality measure of language models. We can calculate it using the formula:

There is a nice short article on what Perplexity here.