信息熵

That transfer of information, from what we don’t know about the system to what we know, represents a change in entropy. Insight decreases the entropy of the system. Get information, reduce entropy. This is information gain. And yes, this type of entropy is subjective, in that it depends on what we know about the system at hand. (Fwiw, information gain is synonymous with Kullback-Leibler divergence, which we explored briefly in this tutorial on restricted Boltzmann machines.)

So each principal component cutting through the scatterplot represents a decrease in the system’s entropy, in its unpredictability.

It so happens that explaining the shape of the data one principal component at a time, beginning with the component that accounts for the most variance, is similar to walking data through a decision tree. The first component of PCA, like the first if-then-else split in a properly formed decision tree, will be along the dimension that reduces unpredictability the most.

KL 散度定义：

交叉熵公式：

信息熵定义：

相对熵达到最小值的时候，也意味着交叉熵达到了最小值，原因是假设真实分布p（x）是一个常数。

准备翻译一下：

https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

https://www.zhihu.com/question/41252833