Mutual Information

Mutal Information, MI, 中文名称:互信息. 用于描述两个概率分布的相似/相关程度. 常用于衡量两个不同聚类算法在同一个数据集的聚类结果的相似性/共享的信息量.
给定两种聚类结果(X,Y), 现在用MI来衡量它们之间的相似程度计算方式为:

[MI(X, Y) = sum_{u in U} sum_{v in V} p(u, v)log frac{p(u, v)}{p(u)p(v)} ]

其中(U=set(X), V = set(Y))(set()为去重操作).
从概率论的角度来理解, (frac{p(u, v)}{p(u)p(v)})描述了(u, v)之间的相关性: 相关性越大, 值越大(大于1)；若独立, 则为1. 从整体来看, (X, Y)的distribution pattern越相似, MI越大.

下面是摘自http://www.cnblogs.com/ziqiao/archive/2011/12/13/2286273.html的matlab代码, 可帮助理解.

function MIhat = nmi( A, B ) %NMI Normalized mutual information
% http://en.wikipedia.org/wiki/Mutual_information
% http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
% Author: http://www.cnblogs.com/ziqiao/   [2011/12/13] 
if length( A ) ~= length( B)
    error('length( A ) must == length( B)');
end
total = length(A);
A_ids = unique(A);
B_ids = unique(B);

% Mutual information
MI = 0;
for idA = A_ids
    for idB = B_ids
         idAOccur = find( A == idA );
         idBOccur = find( B == idB );
         idABOccur = intersect(idAOccur,idBOccur); 
         
         px = length(idAOccur)/total;
         py = length(idBOccur)/total;
         pxy = length(idABOccur)/total;
         
         MI = MI + pxy*log2(pxy/(px*py)+eps); % eps : the smallest positive number

    end
end

% Normalized Mutual information
Hx = 0; % Entropies
for idA = A_ids
    idAOccurCount = length( find( A == idA ) );
    Hx = Hx - (idAOccurCount/total) * log2(idAOccurCount/total + eps);
end
Hy = 0; % Entropies
for idB = B_ids
    idBOccurCount = length( find( B == idB ) );
    Hy = Hy - (idBOccurCount/total) * log2(idBOccurCount/total + eps);
end

MIhat = 2 * MI / (Hx+Hy);
end

% Example :  
% (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
% A = [1 1 1 1 1 1   2 2 2 2 2 2    3 3 3 3 3];
% B = [1 2 1 1 1 1   1 2 2 2 2 3    1 1 3 3 3];
% nmi(A,B)% ans = 0.3646