机器学习笔记之变分推断(二)公式推导过程(基于平均场假设)

机器学习笔记之变分推断——基于平均场假设的公式推导过程

引言
- 回顾：推断与变分推断
- 变分推断：公式推导过程
- - 初始转化过程
  - 最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程
  - - 基于平均场假设的求解思想
    - 求解过程实现

引言

上一节介绍了分别从频率角度和贝叶斯角度认识机器学习问题，并介绍了推断(Inference)在整个贝叶斯角度的重要作用。本节将正式介绍确定性近似推断的代表方法——变分推断(Variational Inference)

提示：本节推导过程与EM算法部分存在相似之处，请对比食用，谢谢。

回顾：推断与变分推断

关于从贝叶斯角度认识问题，本质上是给定样本集合 X \mathcal X X，针对陌生数据 x ^ \hat x x^的预测问题，即 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)。

基于上述逻辑，贝叶斯角度的具体做法是：针对样本集合 X \mathcal X X构建模型，通过模型参数 θ \theta θ作为样本集合 X \mathcal X X与陌生数据 x ^ \hat x x^之间构建关系的桥梁，将 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)表示为如下形式： P ( x ^ ∣ X ) = ∫ θ P ( x ^ , θ ∣ X ) d θ = ∫ θ P ( x ^ ∣ θ ) ⋅ P ( θ ∣ X ) d θ \begin{aligned} P(\hat x \mid \mathcal X) & = \int_{\theta} P(\hat x ,\theta \mid \mathcal X) d\theta \\ & = \int_{\theta} P(\hat x \mid \theta) \cdot P(\theta \mid \mathcal X)d\theta \end{aligned} P(x^∣X)=∫θP(x^,θ∣X)dθ=∫θP(x^∣θ)⋅P(θ∣X)dθ

基于上述公式，使用贝叶斯定理求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)： P ( θ ∣ X ) = P ( X ∣ θ ) ⋅ P ( θ ) P ( X ) \begin{aligned} P(\theta \mid \mathcal X) & = \frac{P(\mathcal X \mid \theta) \cdot P(\theta)}{P(\mathcal X)} \end{aligned} P(θ∣X)=P(X)P(X∣θ)⋅P(θ) 至此，关于求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)的过程称为推断；关于样本数据的边缘概率分布 P ( X ) P(\mathcal X) P(X)可看成一个积分操作： 引入‘隐变量’。 P ( X ) = ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z = ∫ z 1 ⋯ ∫ z K P ( X ∣ Z ) ⋅ P ( Z ) d z , ⋯ , z K \begin{aligned} P(\mathcal X) & = \int_{\mathcal Z} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z)d\mathcal Z \\ & = \int_{z_1} \cdots \int_{z_{\mathcal K}} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z) d z_,\cdots,z_{\mathcal K} \end{aligned} P(X)=∫ZP(X∣Z)⋅P(Z)dZ=∫z1⋯∫zKP(X∣Z)⋅P(Z)dz,⋯,zK 如果引入的隐变量 Z = ( z 1 , ⋯ , z K ) T \mathcal Z = (z_1,\cdots,z_{\mathcal K})^{T} Z=(z1,⋯,zK)T中的维度 K \mathcal K K过高，导致 P ( X ) P(\mathcal X) P(X)积分困难，最终使得 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)无法求解。因此，需要使用一些方法近似求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)。而变分推断(Variational Inference,VI)就是近似推断中，确定性近似的代表方法。

变分推断：公式推导过程

近似推断(Approximate Inference)的核心观点是针对 ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z \int_{\mathcal Z} P(\mathcal X \mid Z) \cdot P(\mathcal Z)d\mathcal Z ∫ZP(X∣Z)⋅P(Z)dZ积分困难的问题，通过找出一个关于隐变量 Z \mathcal Z Z的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)去逼近后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)。即： Q ( Z ) ≈ P ( Z ∣ X ) \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q(Z)≈P(Z∣X) 这里定义：

X \mathcal X X为观测变量(Observed Data)，即真实的样本数据；
Z \mathcal Z Z表示隐变量(Latent Data)和模型参数(Parameter)的统称。 因为‘隐变量’本身就不真实存在，它只是一个‘表达’概率模型的中间环节。因此，‘隐变量 + 模型参数’合并在一起是合理的。 该定义与EM算法中的定义式略有区分的，EM算法中，隐变量是隐变量，参数是参数。
依然将联合概率分布 ( X , Z ) (\mathcal X,\mathcal Z) (X,Z)称作完整数据(Complete Data)。

初始转化过程

在EM算法中，底层逻辑是使用极大似然估计(Maximum Likelihood Estimate,MLE)进行求解，并且求解的是模型参数 θ \theta θ；在变分推断求解过程中，模型参数 θ \theta θ合并进隐变量 Z \mathcal Z Z中。这里依然从概率模型 P ( X ) P(\mathcal X) P(X)入手，执行推导过程： 条件概率公式~为方便推导过程，依然保留‘log函数’。 log ⁡ P ( X ) = log ⁡ [ P ( X , Z ) P ( Z ∣ X ) ] = log ⁡ P ( X , Z ) − log ⁡ P ( Z ∣ X ) \begin{aligned} \log P(\mathcal X) & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{P(\mathcal Z \mid \mathcal X)}\right] \\ & = \log P(\mathcal X,\mathcal Z) - \log P(\mathcal Z \mid \mathcal X) \end{aligned} logP(X)=log[P(Z∣X)P(X,Z)]=logP(X,Z)−logP(Z∣X)

和EM算法推导思路相同，引入一个关于隐变量的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)。则有： log ⁡ P ( X ) = [ log ⁡ P ( X , Z ) − log ⁡ Q ( Z ) ] − [ log ⁡ P ( Z ∣ X ) − log ⁡ Q ( Z ) ] = log ⁡ [ P ( X , Z ) Q ( Z ) ] − log ⁡ [ P ( Z ∣ X ) Q ( Z ) ] \begin{aligned} \log P(\mathcal X) & = \left[\log P(\mathcal X,\mathcal Z) - \log \mathcal Q(\mathcal Z)\right] - \left[\log P(\mathcal Z \mid \mathcal X) - \log \mathcal Q(\mathcal Z)\right] \\ & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] - \log \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] \end{aligned} logP(X)=[logP(X,Z)−logQ(Z)]−[logP(Z∣X)−logQ(Z)]=log[Q(Z)P(X,Z)]−log[Q(Z)P(Z∣X)]
等式两端分别对 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)求积分：
- 等式左端： ∫ Z log ⁡ P ( X ) ⋅ Q ( Z ) d Z = log ⁡ P ( X ) ∫ Z Q ( Z ) d Z = log ⁡ P ( X ) \int_{\mathcal Z} \log P(\mathcal X) \cdot \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) ∫ZlogP(X)⋅Q(Z)dZ=logP(X)∫ZQ(Z)dZ=logP(X)
- 等式右端： ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z − ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( Z ∣ X ) Q ( Z ) ] d Z \begin{aligned} \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \log\left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \end{aligned} ∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ−∫ZQ(Z)⋅log[Q(Z)P(Z∣X)]dZ
在EM算法中定义 ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ为证据下界(Evidence Lower Bound,ELBO)； − ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( Z ∣ X ) Q ( Z ) ] d Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \log\left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z −∫ZQ(Z)⋅log[Q(Z)P(Z∣X)]dZ是关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)和隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的 K L \mathcal K\mathcal L KL散度；

因此， log ⁡ P ( X ) \log P(\mathcal X) logP(X)将表示如下： log ⁡ P ( X ) = E L B O + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = ELBO + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=ELBO+KL[Q(Z)∣∣P(Z∣X)]
将概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)看成一个变量，并将 E L B O ELBO ELBO部分定义成一个关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的函数： L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]： 称 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]为 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的变分。 log ⁡ P ( X ) = L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)] 观察上式，根据数学定义， K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] ≥ 0 \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \geq 0 KL[Q(Z)∣∣P(Z∣X)]≥0恒成立，则有： log ⁡ P ( X ) ≥ L [ Q ( Z ) ] \log P(\mathcal X) \geq \mathcal L[\mathcal Q(\mathcal Z)] logP(X)≥L[Q(Z)]
换句话说， L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界，并且上界结果是 log ⁡ P ( X ) \log P(\mathcal X) logP(X)。

当观测变量 X \mathcal X X给定的条件下， log ⁡ P ( X ) \log P(\mathcal X) logP(X)可看作固定常数；因此 L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)]结果也是固定常数。

在上述条件下， L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]在什么情况下能够取得最大值？自然是 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于0，那么 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于上界 log ⁡ P ( X ) \log P(\mathcal X) logP(X)。

而 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]本身数学性质表示概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的关联关系： Q ( Z ) ≈ P ( Z ∣ X ) ⇒ K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] → 0 \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) \Rightarrow \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \to 0 Q(Z)≈P(Z∣X)⇒KL[Q(Z)∣∣P(Z∣X)]→0

至此，逻辑关系已经找到：目标：找到一个与隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)近似的分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)，替代 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)参与运算；

基于 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界 log ⁡ P ( X ) \log P(\mathcal X) logP(X)，并且 log ⁡ P ( X ) \log P(\mathcal X) logP(X)只和观测变量 X \mathcal X X有关，视为常数。
在 log ⁡ P ( X ) \log P(\mathcal X) logP(X)固定的情况下， L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]越大， K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越小，因而 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)越近似。

从而将寻找近似分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)转化为最大化 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]的问题。

最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程

基于上述问题转化，基于隐变量的最优分布 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)可进行如下表示： Q ^ ( Z ) = arg ⁡ max ⁡ Q ( Z ) L [ Q ( Z ) ] ⇒ Q ^ ( Z ) ≈ P ( Z ∣ X ) \hat {\mathcal Q}(\mathcal Z) = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \Rightarrow \hat {\mathcal Q}(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q^(Z)=Q(Z)argmaxL[Q(Z)]⇒Q^(Z)≈P(Z∣X)

基于平均场假设的求解思想

对概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)进行假设：平均场假设(Mean Theory)。上面定义 Z \mathcal Z Z是隐变量和模型参数的统称。这里关于 Z \mathcal Z Z数量的统计是指概率图中状态变量的数量。假设状态变量的数量是 K \mathcal K K个，并将其划分成 M \mathcal M M个组，每个组中的状态变量之间不存在重复。数学符号表达如下： Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) = Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) \begin{aligned} \mathcal Q(\mathcal Z) & = \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \\ & = \mathcal Q_1(\mathcal Z^{(1)}) \cdot \mathcal Q_2(\mathcal Z^{(2)}) \cdots \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \end{aligned} Q(Z)=i=1∏MQi(Z(i))=Q1(Z(1))⋅Q2(Z(2))⋯QM(Z(M)) 其中， Z ( i ) \mathcal Z^{(i)} Z(i)表示第 i i i个子集合包含的状态变量的具体结果，而 Q i ( Z ( i ) ) \mathcal Q_i(\mathcal Z^{(i)}) Qi(Z(i))表示向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布。也可以理解为：第 i i i个子集合所包含状态变量的联合概率分布。

具体想法是，由于各子集合之间相互独立，想要求解 Q j ( Z ( j ) ) ( j = { 1 , 2 , ⋯ , M } ) \mathcal Q_j(\mathcal Z^{(j)}) (j=\{1,2,\cdots,\mathcal M\}) Qj(Z(j))(j={1,2,⋯,M})，则需要固定 Q − j ( Z ( − j ) ) \mathcal Q_{-j}(\mathcal Z^{(-j)}) Q−j(Z(−j))分量的结果； Q − j ( Z ( − j ) ) \mathcal Q_{-j}(\mathcal Z^{(-j)}) Q−j(Z(−j))表示除去编号 j j j子集合后剩余子集合的概率分布。
基于上述步骤，会分别求解出各向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布 Q i ( Z ( i ) ) ( i = 1 , 2 , ⋯ M ) \mathcal Q_{i}(\mathcal Z^{(i)})(i=1,2,\cdots \mathcal M) Qi(Z(i))(i=1,2,⋯M)，最后相乘即可。

求解过程实现

将 Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) \mathcal Q(\mathcal Z) =\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) Q(Z)=∏i=1MQi(Z(i))带入 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]中： L [ Q ( Z ) ] = ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z = ∫ Z Q ( Z ) log ⁡ P ( X , Z ) d Z − ∫ Z Q ( Z ) log ⁡ Q ( Z ) d Z \begin{aligned} \mathcal L[\mathcal Q(\mathcal Z)] & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \\ & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log P(\mathcal X,\mathcal Z)d \mathcal Z - \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \mathcal Q(\mathcal Z) d\mathcal Z \end{aligned} L[Q(Z)]=∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ=∫ZQ(Z)logP(X,Z)dZ−∫ZQ(Z)logQ(Z)dZ

带入第一项： 按照上述固定要求，将 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))提出来。 ∫ Z ∏ i = 1 M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯ , d Z ( M ) = ∫ Z ( j ) Q j ( Z ( j ) ) ⋅ { ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯ , Z ( j − 1 ) , Z ( j + 1 ) , ⋯ , Z ( M ) } d Z ( j ) \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)}) \cdot \log P(\mathcal X,\mathcal Z) \right] d\mathcal Z^{(1)},\cdots,d\mathcal Z^{(\mathcal M)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \cdot \left\{\int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)}\right\} d \mathcal Z^{(j)} \end{aligned} ∫Zi=1∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,dZ(M)=∫Z(j)Qj(Z(j))⋅⎩ ⎨ ⎧∫Z(1)⋯∫Z(j−1)∫Z(j+1)⋯∫Z(M)i=j∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)⎭ ⎬ ⎫dZ(j) 观察上述大括号中的项，我们可以将其看做期望形式： ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯ , Z ( j − 1 ) , Z ( j + 1 ) , ⋯ , Z ( M ) = E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] \begin{aligned} & \int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)} \\ & = \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] \end{aligned} ∫Z(1)⋯∫Z(j−1)∫Z(j+1)⋯∫Z(M)i=j∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)=E∏i=jMQi(Z(i))[logP(X,Z)] 至此，第一项表示为： ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] d Z j \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z_j ∫ZjQj(Z(j))E∏i=jMQi(Z(i))[logP(X,Z)]dZj
带入第二项： log部分改成求和方式~ ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ ∏ i = 1 M Q i ( Z ( i ) ) d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ ∑ i = 1 M log ⁡ Q i ( Z ( i ) ) d Z \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \\ & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \sum_{i=1}^{\mathcal M} \log\mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \end{aligned} ∫Zi=1∏MQi(Z(i))⋅logi=1∏MQi(Z(i))dZ=∫Zi=1∏MQi(Z(i))⋅i=1∑MlogQi(Z(i))dZ 继续展开(连加部分展开)： = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ [ log ⁡ Q 1 ( Z ( 1 ) ) + log ⁡ Q 2 ( Z ( 2 ) ) + ⋯ + log ⁡ Q M ( Z ( M ) ) ] d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z + ⋯ + ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q M ( Z ( M ) ) d Z \begin{aligned} & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \left[\log \mathcal Q_1(\mathcal Z^{(1)}) + \log \mathcal Q_2(\mathcal Z^{(2)}) + \cdots + \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \right] d\mathcal Z \\ & = \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z + \cdots + \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) d\mathcal Z \end{aligned} =∫Zi=1∏MQi(Z(i))⋅[logQ1(Z(1))+logQ2(Z(2))+⋯+logQM(Z(M))]dZ=∫Zi=1∏MQi(Z(i))⋅logQ1(Z(1))dZ+⋯+∫Zi=1∏MQi(Z(i))⋅logQM(Z(M))dZ
- 观察上式中的第一项： ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z ∫Zi=1∏MQi(Z(i))⋅logQ1(Z(1))dZ
- 将该项中的多重积分和连乘符号全部展开： ∫ Z ( 1 ) ∫ Z ( 2 ) ⋯ ∫ Z ( M ) [ Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) ] d Z ( 1 ) , Z ( 2 ) , ⋯ , Z ( M ) \int_{\mathcal Z^{(1)}} \int_{\mathcal Z^{(2)}} \cdots \int_{\mathcal Z^{(\mathcal M)}} \left[\mathcal Q_1(\mathcal Z^{(1)}) \cdot Q_2(\mathcal Z^{(2)}) \cdots Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)})\right] d \mathcal Z^{(1)},\mathcal Z^{(2)},\cdots,\mathcal Z^{(\mathcal M)} ∫Z(1)∫Z(2)⋯∫Z(M)[Q1(Z(1))⋅Q2(Z(2))⋯QM(Z(M))⋅logQ1(Z(1))]dZ(1),Z(2),⋯,Z(M)
- 将积分带入： ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log ⁡ Q 1 ( Z ( 1 ) ) d Z ( 1 ) ⋅ ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) ⋯ ∫ Z ( M ) Q M ( Z ( M ) ) \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} \cdot \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) \cdots \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) ∫Z(1)Q1(Z(1))logQ1(Z(1))dZ(1)⋅∫Z(2)Q2(Z(2))⋯∫Z(M)QM(Z(M))
- 从第二项开始到最后一项，使用概率密度积分结果均为 1 1 1： ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) = ⋯ = ∫ Z ( M ) Q M ( Z ( M ) ) = 1 \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) =\cdots = \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) = 1 ∫Z(2)Q2(Z(2))=⋯=∫Z(M)QM(Z(M))=1
- 因此，则有： ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z = ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log ⁡ Q 1 ( Z ( 1 ) ) d Z ( 1 ) \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z = \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} ∫Zi=1∏MQi(Z(i))⋅logQ1(Z(1))dZ=∫Z(1)Q1(Z(1))logQ1(Z(1))dZ(1)
- 因此，整个连加部分展开式可以写成： ∑ i = 1 M ∫ Z ( i ) Q i ( Z ( i ) ) log ⁡ Q i ( Z ( i ) ) d Z ( i ) \sum_{i=1}^{\mathcal M} \int_{\mathcal Z^{(i)}} \mathcal Q_i(\mathcal Z^{(i)}) \log \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z^{(i)} i=1∑M∫Z(i)Qi(Z(i))logQi(Z(i))dZ(i)
- 基于求解项是 Z ( j ) \mathcal Z^{(j)} Z(j)，将 Z ( j ) \mathcal Z^{(j)} Z(j)与其他向分开： 其他项不是需要求解的内容，视为常数。用 C \mathcal C C表示。 ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)+C
至此，观察第一项和第二项的结果： ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] d Z ( j ) ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z^{(j)} \\ \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ∫ZjQj(Z(j))E∏i=jMQi(Z(i))[logP(X,Z)]dZ(j)∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)+C 这两项很相似，它们有共同的积分 ∫ Z ( j ) \int_{\mathcal Z^{(j)}} ∫Z(j)，第一个乘数项均是 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))，因此，为了能够使两项合并，设定 E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] E∏i=jMQi(Z(i))[logP(X,Z)]是关于 X , Z ( j ) \mathcal X,\mathcal Z^{(j)} X,Z(j)的函数。即： 首先， Z \mathcal Z Z中包含 Z ( j ) \mathcal Z^{(j)} Z(j); 其次，由于除了 Z ( j ) \mathcal Z^{(j)} Z(j)之外，其余变量均被固定住，视作常量。因此则有 ϕ ^ ( X , Z ( j ) ) \hat \phi(\mathcal X,\mathcal Z^{(j)}) ϕ^(X,Z(j))的表达。 E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] = log ⁡ ϕ ^ ( X , Z ( j ) ) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] = \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) E∏i=jMQi(Z(i))[logP(X,Z)]=logϕ^(X,Z(j)) 此时，两项合并的结果如下： ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ ϕ ^ ( X , Z ( j ) ) d Z ( j ) − ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) = ∫ Z ( j ) Q j ( Z ( j ) ) [ log ⁡ ϕ ^ ( X , Z ( j ) ) − log ⁡ Q j ( Z ( j ) ) ] = ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ [ ϕ ^ ( X , Z ( j ) ) Q j ( Z ( j ) ) ] d Z ( j ) + C = − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] \begin{aligned} & \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) d\mathcal Z^{(j)} - \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \left[\log \hat \phi(\mathcal X,\mathcal Z^{(j)}) - \log \mathcal Q_j(\mathcal Z^{(j)})\right] \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \left[\frac{\hat \phi(\mathcal X,\mathcal Z^{(j)})}{\mathcal Q_j(\mathcal Z^{(j)})}\right]d \mathcal Z^{(j)} +\mathcal C \\ & = - \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right] \end{aligned} ∫Z(j)Qj(Z(j))logϕ^(X,Z(j))dZ(j)−∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)=∫Z(j)Qj(Z(j))[logϕ^(X,Z(j))−logQj(Z(j))]=∫Z(j)Qj(Z(j))log[Qj(Z(j))ϕ^(X,Z(j))]dZ(j)+C=−KL[ϕ^(X,Z(j))∣∣Qj(Z(j))] 最终： Q j ^ ( Z ( j ) ) = arg ⁡ max ⁡ Q ( Z ) L [ Q ( Z ) ] = arg ⁡ max ⁡ Q ( Z ) { − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] } \begin{aligned} \hat{\mathcal Q_j}(\mathcal Z^{(j)}) & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \\ & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \left\{- \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right]\right\} \end{aligned} Qj^(Z(j))=Q(Z)argmaxL[Q(Z)]=Q(Z)argmax{−KL[ϕ^(X,Z(j))∣∣Qj(Z(j))]} 因此，基于平均场假设的 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))的最优解即： Q j ^ ( Z ( j ) ) = ϕ ^ ( X , Z ( j ) ) ( j ∈ { 1 , 2 , ⋯ , M } ) \hat {\mathcal Q_j}(\mathcal Z^{(j)}) = \hat \phi(\mathcal X,\mathcal Z^{(j)}) (j \in \{1,2,\cdots,\mathcal M\}) Qj^(Z(j))=ϕ^(X,Z(j))(j∈{1,2,⋯,M}) 从而最终求得最优解 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)： Q ^ ( Z ) = ∏ j = 1 M Q j ( Z ( j ) ) \hat {\mathcal Q}(\mathcal Z) = \prod_{j=1}^{\mathcal M} \mathcal Q_j(\mathcal Z^{(j)}) Q^(Z)=j=1∏MQj(Z(j))

基于平均场假设，关于 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的近似求解推导结束。下一节将介绍变分推断与EM算法之间的关联关系。

相关参考：变分法：理解变分机器学习-变分推断2（公式推导）

机器学习笔记之变分推断(二)公式推导过程(基于平均场假设)

最近更新

热门博客

[ 申请 ]友情链接：