Entropy, relative entropy and mutual information

Entropy, relative entropy and mutual information.

Entropy

[H(X) = -sum_{x} p(x) log p(x), ]

熵非负, 且当且仅当(X)确定性的时候为有最小值0, 即(P(X=x_0)=1).

Proof:

(log)的凹性可得

[egin{array}{ll} H(X) & = -sum_{x} p(x) log p(x) \ & = sum_{x} p(x) log frac{1}{p(x)} \ & ge log 1=0. end{array} ]

Joint Entropy

[H(X,Y) := -mathbb{E}_{p(x, y)} [log p(x, y)] = sum_{x in mathcal{X}} sum_{yin mathcal{Y}} p(x, y) log p(x, y). ]

Conditional Entropy

[egin{array}{ll} H(Y|X) &= - mathbb{E}_{p(x)} [H(Y|X=x)] \ &= - sum_{x in mathcal{X}} p(x) H(Y|X=x) \ &= - sum_{x in mathcal{X}} sum_{y in mathcal{Y}} p(x)p(y|x) log p(y|x) \ &= - sum_{x in mathcal{X}} sum_{y in mathcal{Y}} p(x, y) log p(y|x). end{array} ]

注意 (H(Y|X))(H(Y|X=x)) 的区别.

Chain rule

[H(X, Y) = H(X) + H(Y|X). ]

proof:

根据(p(y|x)=frac{p(x, y)}{p(x)})以及上面的推导可知:

[egin{array}{ll} H(Y|X) &= H(X,Y) + sum_{x in mathcal{X}} sum_{y in mathcal{Y}} p(x, y) log p(x) \ &= H(X, Y) -H(X). end{array} ]

推论:

[H(X,Y| Z) = H(X|Z) + H(Y|X, Z). ]

[egin{array}{ll} H(Y|X,Z) &= mathbb{E}_{x,z} [H(Y|x,z)] \ &= -sum_{x,z} p(x,z) p(y|x,z) log p(y|x,z) \ &= -sum_{x, z} p(x, y, z) [log p(x, y|z) - log p(x|z)] \ &= mathbb{E}_{z} H(X, Y|z) - mathbb{E}_{z} H(X|z) = H(X, Y|Z) - H(X|Z). end{array} ]

Mutual Information

[I(X;Y) = H(X) - H(X|Y) = sum_{x in mathcal{X}} sum_{y in mathcal{Y}}p(x, y)log frac{p(x, y)}{p(x)p(y)} ]

  • (I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = I(Y;X) = H(X) + H(Y) - H(X, Y) ge 0)

  • (I(X, X) = H(X))

Relative Entropy

[D(p|q) := mathbb{E}_p (log frac{p(x)}{q(x)}) = sum_{xin mathcal{X}} p(x) log frac{p(x)}{q(x)}. ]

Chain Rules

Chain Rule for Entropy

((X_1, X_2,ldots, X_n) sim p(x_1, x_2, ldots, x_n)):

[H(X_1, X_2, ldots, X_n) = sum_{i-1}^n H(X_i|X_{i-1}, ldots, X_1). ]

proof:

归纳法 + (H(X, Y) = H(X) + H(Y|X)).

Chain Rule for Mutual Information

Conditional Mutual Information

定义:

[I(X;Y|Z) := H(X|Z) - H(X|Y,Z)= mathbb{E}_{p(x, y,z)} log frac{p(X, Y| Z)}{p(X|Z)p(Y|Z)}. ]

性质:

[I(X_1, X_2, ldots, X_n; Y) = sum_{i=1}^n I(X_i;Y|X_{i-1}, ldots, X_1). ]

proof:

[egin{array}{ll} I(X_1, X_2, ldots, X_n; Y) & =H(X_1, ldots, X_n) + H(Y) - H(X_1,ldots, X_n;Y) \ &= H(X_1,ldots, X_{n-1}) + H(X_n|X_1,ldots, X_{n-1}) + H(Y) - H(X_1, ldots, X_n;Y) \ &= I(X_1, X_2,ldots, X_{n-1};Y) + H(X_n|X_1,ldots, X_{n-1}) - H(X_n|X_1, ldots, X_{n-1};Y) \ &= I(X_1, X_2,ldots, X_{n-1};Y) + I(X_n;Y|X_1,ldots, X_{n-1}). \ end{array} ]

Chain Rule for Relative Entropy

定义:

[egin{array}{ll} D(p(y|x)|q(y|x)) &:= mathbb{E}_{p(x, y)} [log frac{p(Y| X)}{q(Y|X)}] \ &= sum_x p(x) sum_y p(y|x) log frac{p(y|x)}{q(y|x)}. end{array} ]

性质:

[D(p(x, y) | q(x, y)) = D(p(x) | q(x)) + D(p(y|x)| q(y|x)). ]

proof:

[egin{array}{ll} D(p(x, y)| q(x, y)) &= sum_{x, y} p(x, y) log frac{p(x, y)}{q(x, y)} \ &= sum_{x, y} p(x, y) log frac{p(y|x)p(x)}{q(y|x)q(x)} \ &= sum_{x, y} [p(x, y) (log frac{p(y|x)}{q(y|x)} + log frac{p(x)}{q(x)})]\ &= D(p(x)|q(x)) + D(p(y|x)|q(y|x)). end{array} ]

补充:

[D(p(x, y) | q(x, y)) = D(p(y) | q(y)) + D(p(x|y)| q(x|y)). ]

故, 当(p(x) = q(x))的时候, 我们可以得到

[D(p(x, y) | q(x, y)) = D(p(y|x)| q(y|x)) ge D(p(y)|q(y)) ]

  1. (D(p(y|x)|q(y|x))=D(p(x, y)| p(x)q(y|x)))

  2. (D(p(x_1, x_2,ldots, x_n)| q(x_1, x_2,ldots, x_m)) = sum_{i=1}^n D(p(x_i|x_{i-1}, ldots, x_1)|q(x_i| x_{i-1}, ldots, x_1)))

  3. (D(p(y)| q(y)) le D(p(y|x)|q(y|x))), (q(x)=p(x)).

1, 2, 3的证明都可以通过上面的稍作变换得到.

Jensen's Inequality

如果(f)是凸函数, 则

[mathbb{E} [f(X)] ge f(mathbb{E}[X]). ]

Properties

  • (D(p|q) ge 0) 当且仅当(p=q)取等号.
  • (I(X; Y) ge 0)当且仅当(X, Y)独立取等号.
  • (D(p(y|x)|q(y|x)) ge 0) (根据上面的性质), 当且仅当(p(y|x) = q(y|x))取等号, (p(x) > 0).
  • (I(X; Y|Z) ge 0), 当且仅当(X, Y)条件独立.
  • (H(X|Y)le H(X)), 当且仅当(X, Y)独立等号成立.
  • (H(X_1, X_2, ldots, X_n)le sum_{i=1}^n H(X_i)), 当且仅当所有变量独立等号成立.

Log Sum Inequality

  • (D(p|q)) 关于((p, q))为凸函数, 即(forall 0le lambda le 1):

    [D(lambda p_1 + (1-lambda)p_2| lambda q_1 + (1-lambda)q_2) le lambda D(p_1|q_1) + (1-lambda)D(p_2 | q_2). ]

此部分的证明, 一方面可以通过(plogfrac{p}{q})的凸性得到, 更有趣的证明是, 构造一个新的联合分布

[p(x,c) = p_1 cdot lambda + p_2 cdot (1- lambda), q(x, c) = q_1 cdot lambda + q_2 cdot (1-lambda). ]

[p(x|c=0)=p_1, p(x|c=1)=p_2, q(x|c=0)=q_1, q(x|c=2)=q_2, \ p(c=0)=q(c=0)=lambda, p(c=1) = q(c=1) = 1-lambda. ]

并注意到(D(p(y)| q(y)) le D(p(y|x)|q(y|x))).

  • (H(X) = -sum_{x in mathcal{X}} p(x) log p(x))是关于(p)的凹函数.
  • (I(X, Y) = sum_{x, y} p(y|x)p(x) log frac{p(y|x)}{p(y)}), 当固定(p(y|x))的时候是关于(p(x))的凹函数, 当固定(p(x))的时候, 是关于(p(y|x))的凸函数.

仅仅证明后半部分, 任给(p_1(y|x), p_2(y|x)), 由于(p(x))固定, 故(forall 0 le lambda le 1):

[p(x, y) := lambda p_1(x, y) + (1-lambda) p_2(x, y) = [lambda p_1(y|x) + (1-lambda) p_2(y|x)]p(x) \ p(y): = sum_x p(x, y) = lambda sum_x p_1(x, y) + (1-lambda) sum_{x} p_2(x, y) \ q(x, y):= p(x)p(y) = sum_x p(x, y) = lambda p(x) sum_x p_1(x, y) + (1-lambda) p(x)sum_{x} p_2(x, y) =: lambda q_1(x, y) + (1-lambda)q_2(x, y).\ ]

[I(X, Y) = D(p(x, y)| p(x)p(y))=D(p(x, y)| q(x,y)), ]

因为KL散度关于((p, q))是凸函数, 所以(I)关于(p(y|x))如此.

Data-Processing Inequality

数据(X ightarrow Y ightarrow Z), 即(P(X, Y,Z) = P(X)P(Y|X)P(Z|Y)) 比如(Y=f(X), Z = g(Y)).

[I(Y, Z;X) = I(X;Y) + I(X;Z|Y)= I(X;Z) + I(X;Y|Z), ]

[I(X;Z|Y) = sum_{x, y, z} p(x, y, z) log frac{p(x,z|y)}{p(x|y)p(z|y)} = sum_{x,y,z}p(x,y,z) log 1 = 0. \ I(X;Y|Z) = sum_{x,y,z} p(x,y,z) log frac{p(x|y)}{p(x|z)}ge 0. ]

[I(X;Z) le I(X;Y) \ I(X;Y|Z) le I(X;Y). ]

Sufficient Statistics

Statistics and Mutual Information

  • 一族概率分布({f_{ heta(x)}})

  • (X sim f_{ heta}(x)), (T(X))为其统计量, 则

    [ heta ightarrow X ightarrow T(X) ]

  • [I( heta;X) ge I( heta;T(X)) ]

Sufficient Statistics and Compression

充分统计量定义: 一个函数(T(X))被称之为一族概率分布({f_{ heta}(x)})的充分统计量, 如果给定(T(X)=t)(X)的条件分布与( heta)无关, 即

[f_{ heta}(x) = f(x|t) f_{ heta}(t) Rightarrow heta ightarrow T(X) ightarrow X Rightarrow I( heta;T(X)) ge I( heta;X). ]

此时, (I( heta;T(X))= I( heta;X)).

最小充分统计量定义: 如果一个充分统计量(T(X))与其余的一切关于({f_{ heta}(x)})的充分统计量(U(X))满足

[ heta ightarrow T(X) ightarrow U(X) ightarrow X. ]

原文地址:https://www.cnblogs.com/MTandHJ/p/13860953.html