pgm12

作为 inference 部分的小结,我们这里对 machine learning 里面常见的三个 model 的 inference 问题进行整理,当然很幸运的是他们都存在 tractable 的算法是的我们避免了烦人的 approximate inference。

HMM

常意所说的 HMM 是对离散状态、离散观测情形下的一种 generative model,它包括

  • 状态的先验分布 pi_k(在下面的推导中我们可以将其藏在转移概率中)
  • 转移状态 	au (k, k'),这是对 k' 的分布
  • 发射概率 e (k, v),这是对 v 的分布

这个模型的潜台词是

  • Markovian property:Pr (s_t mid s_1, ldots, s_{t - 1}) = Pr (s_t mid s_{t - 1})
  • time-invariance:Pr (s_t = k' mid s_{t - 1} = k)

因此联合分布的概率为

displaystyle Pr (o_{1, ldots, t}, s_{0, ldots, t}) = Pr(s_0)prod_{i = 1}^t Pr (s_i mid s_{i - 1}) Pr (o_i mid s_i),

其中 Pr(s_0) = 1 故可省略。下面我们分别讨论这上面的 message passing、belief update 和一些常见的 inference 问题。

message passing 需要建立一个 cluster graph,当然实际也是一个 clique tree,这个图上的顶点包括 C_i,这是将 o_is_{i-1}s_i 绑在一起,i = 1, ldots, t;则每个对应的 psi_i (s_i, s_{i-1}, o_i)= Pr (s_i mid s_{i - 1}) Pr (o_i mid s_i), i = 1, ldots, t。于是可以计算前向的消息,

displaystyledelta_{1	o 2} (s_1) = sum_{o_1} psi_1 (s_1, s_0, o_1), quad delta_{i 	o i + 1} (s_i) = sum_{o_i} sum_{s_{i-1}} delta_{i - 1	o i} (s_{i - 1}) psi_i (s_i, s_{i - 1}, o_i)

其中 i = 2, ldots, t - 1,后向消息为

displaystyledelta_{t	o t-1} (s_{t - 1}) = sum_{s_t} sum_{o_t} phi_t (s_t, s_{t - 1}, o_t) , quad delta_{i	o i-1} (s_{i - 1}) = sum_{s_i} sum_{o_i} delta_{i+1 	o i} (s_i) psi_i (s_i, s_{i - 1}, o_i)

其中 i = t - 1, ldots, 1。如果仔细分析一下这些消息,我们就会发现,前向消息其实是边际分布

displaystyle delta_{1	o 2} (s_1) = sum_{o_1} psi_1 (s_1, s_0, o_1) = sum_{o_1} Pr (s_1mid s_0) Pr (o_1 mid s_1) = Pr (s_1) = pi_{s_1}

我们可以继续代入后面的消息里面,

displaystyledelta_{i	o i + 1} (s_0) = sum_{s_i} sum_{o_i} delta_{i - 1 	o i} (s_i) psi_i (s_i, s_{i - 1}, o_i)= sum_{o_{i - 1}} sum_{o_i} Pr (s_{i - 1}) Pr (s_imid s_{i - 1}) Pr (o_i mid s_i) = Pr(s_i)

如果观测是给定的,即 o_{0, ldots, i} 已知,这获得的将是 Pr (s_i, o_{0, ldots, i})。对后向消息而言,

displaystyledelta_{t	o t-1} (s_{t - 1}) = sum_{s_t} sum_{o_t} phi_t (s_t, s_{t - 1}, o_t) = sum_{s_t} sum_{o_t} Pr (s_t mid s_{t - 1}) Pr (o_t mid s_t) equiv 1

代入后面的消息有

displaystyledelta_{i	o i-1} (s_{i - 1}) = sum_{s_i} sum_{o_i} delta_{i+1 	o i} (s_i) psi_i (s_i, s_{i - 1}, o_i) = sum_{s_i} delta_{i+1 	o i} Pr (s_i mid s_{i - 1}) equiv 1

都是常数。如果 o_{i, ldots, t} 是已知的,这将获得 Pr (s_i, o_{i, ldots, t})

对于 MAP 类型的 query,我们需要使用 max-product 算法,此时的前向消息为(e_k = max_v e(k, v)

displaystyle delta_{1	o 2} (s_1) = max_{o_1} psi_1 (s_1, s_0, o_1) = pi_{s_1} e_{s_1}

displaystyle delta_{i 	o i + 1} (s_i) = max_{o_i} max_{s_{i-1}} delta_{i - 1	o i} (s_{i - 1}) psi_i (s_i, s_{i - 1}, o_i) = e_{s_i} max_k delta_{i-1	o i} (k) 	au(k, s_i)

后向消息为

displaystyle delta_{t	o t-1} (s_{t - 1}) = max_{s_t} max_{o_t} phi_t (s_t, s_{t - 1}, o_t) = max_k 	au(s_{t - 1}, k) e_k

displaystyle delta_{i	o i-1} (s_{i - 1}) = max_{s_i} max_{o_i} delta_{i+1 	o i} (s_i) psi_i (s_i, s_{i - 1}, o_i) = max_k delta_{i+1 	o i} (k) 	au (s_{i - 1}, k) e_k

对 belief update 来说,belief 是 o_i, s_i, s_{i - 1} 上的边际分布

displaystyle eta_1 = psi_1 delta_{2	o 1} = Pr (s_0)Pr (s_1 mid s_0) Pr (o_1mid s_1), quad eta_i = psi_idelta_{i-1	o i} delta_{i+1	o i} = psi_i Pr (s_{i-1}) = Pr (s_{i-1}, s_i, o_i), quad eta_t = psi_t delta_{t-1	o t} = psi_t Pr (s_{t-1}) = Pr (s_{t - 1}, s_t, o_t)

而对应的 belief update 为

displaystyle mu_{i, i+1} (s_i) = delta_{i	o i + 1} (s_i) delta_{i + 1	o i} (s_i) = Pr (s_i)

类似可以导出 MAP 类型下的形式。这样,对于 filtering 来说 Pr (s_i mid o_{0, ldots, i}) propto Pr (s_i, o_{0, ldots, i}) 可以将前向消息归一化,而 prediction 使用的概率

displaystyle Pr (s_{i + 1} mid o_{0, ldots, i}) = sum_{s_i} Pr (s_{i + 1}, s_i mid o_{0, ldots, i}) = sum_{s_i} Pr (s_{i + 1} mid s_i) Pr (s_i mid o_{0, ldots, i}) = sum_k 	au (k, s_{i + 1}) Pr (s_i mid o_{0, ldots, i})

是归一化后的值。smoothing 需要求 argmax Pr (s_{0, ldots, t} mid o_{0, ldots, t}),本质上就是 argmax Pr (s_{0, ldots, t}, o_{0, ldots, t}),这直接使用 MAP 类型两种 message 就能给出两种算法。

LDS

LDS 和 HMM 具有类似的图结构,但是对应的状态和观测均为连续分布,因而常使用 Gaussian 建模。

displaystylePr (o_{1, ldots, t}, s_{0, ldots, t})=Pr (s_0) prod_{i = 1}^{t - 1} Pr (s_i mid s_{i - 1}) Pr(o_i mid s_i)

其中,

displaystylePr(s_0)sim N(cdot midmu_0, Gamma_0), quad Pr(s_i mid s_{i - 1}) sim N (cdot mid A s_{i - 1}, Gamma), quad Pr (o_i mid s_i) sim N (cdot mid C s_i, Sigma)

另一种描述这种关系的形式是使用 additive noise,

displaystyle s_0 = mu_0 + epsilon_0, epsilon_0 sim N (cdot mid 0, Gamma_0), quad s_i = A s_{i - 1} + epsilon_i, epsilon_i sim N (cdot mid 0, Gamma), quad o_i = C s_i + xi_i, xi_i sim N (cdot mid 0, Sigma).

使用的 clique tree 与前面一致,前向消息为

displaystyle delta_{1	o 2} (s_1) = iint N (s_0 mid mu_0, Gamma_0) N (s_1 mid As_0, Gamma) Pr (o_1 mid C s_1, Sigma) \, mathrm{d} s_0 mathrm{d} o_0 = N(s_1 mid A mu_0, AGamma_0A^	op + Gamma) = Pr (s_1)

displaystyle delta_{i 	o i + 1} (s_i) = iint delta_{i-1	o i} (s_{i - 1}) N (s_i mid As_{i - 1}, Gamma) Pr (o_i mid C s_i, Sigma) \, mathrm{d} s_{i - 1} mathrm{d} o_i = N(s_i mid A mu_{i - 1}, AGamma_{i - 1}A^	op + Gamma) = Pr (s_i),

其中 mathbb{E} s_{i - 1} = mu_i and 	ext{var}\, s_{i - 1} = Gamma_{i - 1},后向消息也均为 1。对 MAP 类型的 query,前向消息为

displaystyle delta_{1	o 2} (s_1) = max_{s_0} N (s_0 mid mu_0, Gamma_0) N (s_1 mid A s_0, Gamma) N (o_1 mid Cs_1, Sigma)

关于 s_0 的优化问题是

displaystyle min_{s_0} s_0^	op (Gamma_0^{-1} + AGamma^{-1} A^	op)^{-1} s_0 - 2 (mu_0^	op Gamma_0^{-1} + s_1^	op Gamma^{-1} A)s_0

其解为

displaystyle s_0^star = (Gamma_0^{-1} + AGamma^{-1} A^	op)^{-1} (Gamma_0^{-1} mu_0 + A^	op Gamma^{-1} s_1)

这是 s_1 的线性函数,因此大致的求解过程是,从 s_{i - 1} 的二次方程中解出 s_{i - 1} 得到一个使用 s_i 的线性函数表示的关系,代入后得到 s_i 的消息,这仍然是一个二次函数,向后代入即可。最后获得的 s_t 的方程解出 s_t 后进行回代就解出了其他的隐变量。

beliefs 为

displaystyle eta_1 (s_0, s_1, o_1) = psi_1 delta_{2 	o 1} = N (s_0mid, mu_0, Gamma_0) N (s_1 mid A s_0, Gamma) N (o_1 mid C s_1, Sigma)

displaystyleeta_i(s_{i-1}, s_i, o_i) = psi_idelta_{i-1	o i}delta_{i+1	o i}

类似有对应 belief。

对 filtering 问题,给定 o_1, ldots, o_i 后计算 Pr (s_i mid o_{1, ldots, i}) 可使用前向消息,

displaystyle Pr (s_1 mid o_1) propto int Pr (s_0) Pr (s_1mid s_0) Pr(o_1 mid s_1) \, mathrm{d} s_0 = N (s_1 mid Amu_0, AGamma_0 A^	op + Gamma) N (o_1 mid Cs_1, Sigma) propto N (s_1 mid xi_1, Lambda_1)

其中,

displaystyle Lambda_1 = (Gamma_1^{-1} + C^	op Sigma^{-1} C)^{-1}, quad xi_1= Lambda_1 (Gamma_1^{-1} mu_1 + C^	op Sigma^{-1} o_1)

displaystyle Pr (s_i mid o_{1,ldots, i}) proptoint Pr (s_{i - 1} mid o_{1, ldots, i - 1}) Pr (s_imid s_{i - 1} Pr(o_i mid s_i)\,mathrm{d} s_{i - 1} = N (s_i mid Axi_{i - 1}, ALambda_{i - 1} A^	op + Gamma) N (o_i mid Cs_i, Sigma propto N (s_i mid xi_i, Lambda_i),

其中

displaystyle Lambda_i = (Lambda_{i - 1}^{-1} + C^	op Sigma^{-1} C)^{-1}, quad xi_i = Lambda_i (Lambda_{i - 1}^{-1} xi_{i - 1} + C^	op Sigma^{-1} o_i).

Lambda_0 = Gamma_1xi_0 = mu_1 则以上计算可用统一的形式表述。

对 prediction 问题,给定 o_1, ldots, o_iPr (s_{i + 1} mid o_{1, ldots, i}) 可使用 filtering 的结果计算

displaystyle Pr (s_{i + 1} mid o_{1, ldots, t}) = int Pr (s_{i + 1} mid s_i) Pr (s_i mid o_{1, ldots, i})\, mathrm{d} s_i = int N (s_{i + 1} mid A s_i, Gamma) N (s_i mid xi_i, Lambda_i) \,mathrm{d} s_i = N (s_{i + 1} mid A xi_i, ALambda_iA^	op + Gamma).

MEMM

我们直接对 Pr (s_i mid o_i) 使用 ME 建模,但是为了引入上下文关系,我们可以将这个 ME 弄成多个 Pr_{s_{i - 1}} (s_i mid o_i),这也就是说前面一个状态决定了后面使用的 ME 的参数。这样似然函数为

displaystyle Pr(s_{0, ldots, t} mid o_{0, ldots, t}) = prod_{i = 1}^t Pr_{s_{i - 1}} (s_imid o_i).

这里的假定有,

  • Markovian 性:Pr (s_i mid s_{0, ldots, i - 1}, o_{0, ldots, i - 1}) = Pr (s_i mid s_{i - 1}, o_i)
  • ME 假定:Pr (s_i mid s_{i - 1}, o_i) = Pr_{s_{i - 1}} (s_i mid o_i) = p_{s_{i - 1}} (s_i mid o_i)

我们使用与 HMM 一致的 cluster graph,前向消息为

displaystyle delta_{1	o 2} (s_1) = p_{s_0} (s_1 mid o_1), quad delta_{i	o i+1} (s_i) = sum_{s_{i-1}} delta_{i-1 	o i} (s_{i - 1}) psi_i = sum_{s_{i - 1}} delta_{i-1 	o i} (s_{i - 1}) p_{s_{i-1}} (s_i mid o_i).

后向消息为

displaystyle delta_{t	o t-1} (s_{t - 1}) = sum_{s_t} p_{s_{t - 1}} (s_t mid o_t), quad delta_{i	o i-1} (s_{i - 1}) = sum_{s_i} delta_{i + 1	o i} (s_i) p_{s_{i - 1}} (s_i mid o_i).

max-product message passing 仅仅需要将求和换成 max。belief propagation 中 belief 为

displaystyle eta_1 (s_0, s_1, o_1) = psi_1 delta_{2 	o 1} = Pr (s_0, s_1 mid o_{1, ldots, t}), quad eta_i (s_{i - 1}, s_i, o_i) = psi_i delta_{i + 1 	o i} delta_{i - 1	o i} = Pr (s_{i - 1}, s_i mid o_{1, ldots, t}), quad eta_t (s_{t - 1}, s_t, o_t) = psi_t delta_{t - 1	o t} = Pr (s_{t - 1}, s_t mid o_{1, ldots, t})

且 belief update 为

displaystylemu_{i, i + 1} = delta_{i	o i+ 1} (s_i) delta_{i+1 	o i} (s_i) =Pr (s_i mid o_{1, ldots, t})

其 filtering、prediction 和 smoothing 算法与 HMM 完全一样。

CRF

其假设为

  • Markovian 性,与 MEMM 类似;
  • invariant factor:对每个 transition,我们引入一个 log-linear 表示,phi_i (s_{i - 1}, s_i, o_i) = expleft( sum_j f_j(s_{i}, o_i, s_{i - 1})
ight),其中 f_j 是所谓的 feature;

类似前面可以定义消息、belief 等。如果需要计算 log-likelihood,我们需要求 partition function 的函数值,这需要使用前向消息

displaystyle Z (o_{1, ldots, t}) = sum_{o_{1, ldots, t}} prod_{i = 1}^t expleft( sum_j lambda_j f_j (s_i, o_i, s_{i - 1})
ight),

就能避免指数求和项,而计算梯度的时候,

displaystyle frac{partial mathcal{L}}{partial lambda_j} = sum_	ext{over samples} sum_{i = 1}^t f_j (s_i, o_i, s_{i - 1}) - langle f_j (s_i, o_i, s_{i - 1})
angle,

其中后者需要 Pr (s_i, s_{i - 1} mid o_{1, ldots, t}),这正是 belief。

——————
And Sarah saw the son of Hagar the Egyptian, which she had born to Abraham, mocking.

原文地址:https://www.cnblogs.com/focus-ml/p/3775461.html