Information Theory

We would like to quantify information in a way that formalizes this intuition. Specifically,

Likely events should have low information content.
Less likely events should have higher information content.
Indenpent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.

In order to satisfy all three properties, we define the self-information of an event x = (x) to be

[I(x) = -logP(x) ]

Self-information deals only with a single outcome. We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:

[H(X) = mathbb{E}_{X sim P}[I(x)] ]

also denoted (H(P)). In other words, the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits (if the logarithm is base 2, otherwise the units are different) needed on average to encode symbols drawn from a distribution (P). Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy.

If we have two separate probability distributions (P(X)) and (Q(X)) over the same random variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence:

[D_{KL}(P||Q) = mathbb{E}_{X sim P}Big[logfrac{P(x)}{Q(x)}Big] = mathbb{E}_{X sim P}[logP(x)-logQ(x)] ]

In the case of discrete variables, it is the extra amount of information (measured in bits if we use the base logarithm, but in machine learning we usually use nats and the natural logarithm) needed to send a message containing symbols drawn from probability distribution (P), when we use a code that was designed to minimize the length of messages drawn from probability distribution (Q).

The KL divergence has many useful properties, most notably that it is non-negative. The KL divergence is 0 if and only if (P) and (Q) are the same distribution in the case of discrete variables, or equal "almost everywhere" in the case of continuous variables.

A quantity that is closely related to the KL divergence is the cross-entropy (H(P,Q)=H(P)+D_{KL}(Q||P)), which is similar to the KL divergence but lacking the term on the left:

[H(P,Q)=-mathbb{E}_{X sim P}logQ(x) ]

Minimizing the cross-entropy with respect to (Q) is equivalent to minimizing the KL divergence, because (Q) does not participate in the omitted term.