Generalization and Zeros

Two question

Overfitting

From Unigram, Bigram, Trigram, Quadrigram, the prediction of Quadrigram is better than Trigram, than Bigram, than Unigram.

But N-grams only work well for word prediction if the test corpus looks like the training corpus. In real life, this doesn't happen, so we should train robust models that do a better job of generalizing.

Zeros

Firstly, if there is V words and we use Bigram, it will generalizate V^2, there are a lot of probality is zero. What's the worse, the Quadrigrams will generalizate more zero.

Secondly, things that never occurred in the training set, but do occur in the test data, and we can never compute perplexity. This is a big problem we need to solve.