- 引言
- 回顾:推断与变分推断
- 变分推断:公式推导过程
- 初始转化过程
- 最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程
- 基于平均场假设的求解思想
- 求解过程实现
上一节介绍了分别从频率角度和贝叶斯角度认识机器学习问题,并介绍了推断(Inference)在整个贝叶斯角度的重要作用。本节将正式介绍确定性近似推断的代表方法——变分推断(Variational Inference)
提示:本节推导过程与EM算法部分存在相似之处,请对比食用,谢谢。
回顾:推断与变分推断关于从贝叶斯角度认识问题,本质上是给定样本集合 X \mathcal X X,针对陌生数据 x ^ \hat x x^的预测问题,即 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)。
基于上述逻辑,贝叶斯角度的具体做法是:针对样本集合 X \mathcal X X构建模型,通过模型参数 θ \theta θ作为样本集合 X \mathcal X X与陌生数据 x ^ \hat x x^之间构建关系的桥梁,将 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)表示为如下形式: P ( x ^ ∣ X ) = ∫ θ P ( x ^ , θ ∣ X ) d θ = ∫ θ P ( x ^ ∣ θ ) ⋅ P ( θ ∣ X ) d θ \begin{aligned} P(\hat x \mid \mathcal X) & = \int_{\theta} P(\hat x ,\theta \mid \mathcal X) d\theta \\ & = \int_{\theta} P(\hat x \mid \theta) \cdot P(\theta \mid \mathcal X)d\theta \end{aligned} P(x^∣X)=∫θP(x^,θ∣X)dθ=∫θP(x^∣θ)⋅P(θ∣X)dθ
基于上述公式,使用贝叶斯定理求解
P
(
θ
∣
X
)
P(\theta \mid \mathcal X)
P(θ∣X):
P
(
θ
∣
X
)
=
P
(
X
∣
θ
)
⋅
P
(
θ
)
P
(
X
)
\begin{aligned} P(\theta \mid \mathcal X) & = \frac{P(\mathcal X \mid \theta) \cdot P(\theta)}{P(\mathcal X)} \end{aligned}
P(θ∣X)=P(X)P(X∣θ)⋅P(θ) 至此,关于求解
P
(
θ
∣
X
)
P(\theta \mid \mathcal X)
P(θ∣X)的过程称为推断; 关于样本数据的边缘概率分布
P
(
X
)
P(\mathcal X)
P(X)可看成一个积分操作: 引入‘隐变量’。
P
(
X
)
=
∫
Z
P
(
X
∣
Z
)
⋅
P
(
Z
)
d
Z
=
∫
z
1
⋯
∫
z
K
P
(
X
∣
Z
)
⋅
P
(
Z
)
d
z
,
⋯
,
z
K
\begin{aligned} P(\mathcal X) & = \int_{\mathcal Z} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z)d\mathcal Z \\ & = \int_{z_1} \cdots \int_{z_{\mathcal K}} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z) d z_,\cdots,z_{\mathcal K} \end{aligned}
P(X)=∫ZP(X∣Z)⋅P(Z)dZ=∫z1⋯∫zKP(X∣Z)⋅P(Z)dz,⋯,zK 如果引入的隐变量
Z
=
(
z
1
,
⋯
,
z
K
)
T
\mathcal Z = (z_1,\cdots,z_{\mathcal K})^{T}
Z=(z1,⋯,zK)T中的维度
K
\mathcal K
K过高,导致
P
(
X
)
P(\mathcal X)
P(X)积分困难,最终使得
P
(
θ
∣
X
)
P(\theta \mid \mathcal X)
P(θ∣X)无法求解。 因此,需要使用一些方法近似求解
P
(
θ
∣
X
)
P(\theta \mid \mathcal X)
P(θ∣X)。而变分推断(Variational Inference,VI)就是 近似推断中,确定性近似的代表方法。
近似推断(Approximate Inference)的核心观点是针对 ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z \int_{\mathcal Z} P(\mathcal X \mid Z) \cdot P(\mathcal Z)d\mathcal Z ∫ZP(X∣Z)⋅P(Z)dZ积分困难的问题,通过找出一个关于隐变量 Z \mathcal Z Z的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)去逼近后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)。即: Q ( Z ) ≈ P ( Z ∣ X ) \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q(Z)≈P(Z∣X) 这里定义:
- X \mathcal X X为观测变量(Observed Data),即真实的样本数据;
-
Z
\mathcal Z
Z表示 隐变量(Latent Data)和 模型参数(Parameter)的统称。
因为‘隐变量’本身就不真实存在,它只是一个‘表达’概率模型的中间环节。因此,‘隐变量 + 模型参数’合并在一起是合理的。
该定义与EM算法中的定义式略有区分的,EM算法中,隐变量是隐变量,参数是参数。
- 依然将联合概率分布 ( X , Z ) (\mathcal X,\mathcal Z) (X,Z)称作 完整数据(Complete Data)。
在EM算法中,底层逻辑是使用极大似然估计(Maximum Likelihood Estimate,MLE)进行求解,并且求解的是模型参数
θ
\theta
θ; 在变分推断求解过程中,模型参数
θ
\theta
θ合并进隐变量
Z
\mathcal Z
Z中。这里依然从概率模型
P
(
X
)
P(\mathcal X)
P(X)入手,执行推导过程: 条件概率公式~为方便推导过程,依然保留‘log函数’。
log
P
(
X
)
=
log
[
P
(
X
,
Z
)
P
(
Z
∣
X
)
]
=
log
P
(
X
,
Z
)
−
log
P
(
Z
∣
X
)
\begin{aligned} \log P(\mathcal X) & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{P(\mathcal Z \mid \mathcal X)}\right] \\ & = \log P(\mathcal X,\mathcal Z) - \log P(\mathcal Z \mid \mathcal X) \end{aligned}
logP(X)=log[P(Z∣X)P(X,Z)]=logP(X,Z)−logP(Z∣X)
-
和EM算法推导思路相同,引入一个关于隐变量的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)。则有: log P ( X ) = [ log P ( X , Z ) − log Q ( Z ) ] − [ log P ( Z ∣ X ) − log Q ( Z ) ] = log [ P ( X , Z ) Q ( Z ) ] − log [ P ( Z ∣ X ) Q ( Z ) ] \begin{aligned} \log P(\mathcal X) & = \left[\log P(\mathcal X,\mathcal Z) - \log \mathcal Q(\mathcal Z)\right] - \left[\log P(\mathcal Z \mid \mathcal X) - \log \mathcal Q(\mathcal Z)\right] \\ & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] - \log \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] \end{aligned} logP(X)=[logP(X,Z)−logQ(Z)]−[logP(Z∣X)−logQ(Z)]=log[Q(Z)P(X,Z)]−log[Q(Z)P(Z∣X)]
-
等式两端分别对 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)求积分:
- 等式左端: ∫ Z log P ( X ) ⋅ Q ( Z ) d Z = log P ( X ) ∫ Z Q ( Z ) d Z = log P ( X ) \int_{\mathcal Z} \log P(\mathcal X) \cdot \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) ∫ZlogP(X)⋅Q(Z)dZ=logP(X)∫ZQ(Z)dZ=logP(X)
- 等式右端: ∫ Z Q ( Z ) ⋅ log [ P ( X , Z ) Q ( Z ) ] d Z − ∫ Z Q ( Z ) ⋅ log [ P ( Z ∣ X ) Q ( Z ) ] d Z \begin{aligned} \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \log\left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \end{aligned} ∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ−∫ZQ(Z)⋅log[Q(Z)P(Z∣X)]dZ
-
在EM算法中定义 ∫ Z Q ( Z ) ⋅ log [ P ( X , Z ) Q ( Z ) ] d Z \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ为 证据下界(Evidence Lower Bound,ELBO); − ∫ Z Q ( Z ) ⋅ log [ P ( Z ∣ X ) Q ( Z ) ] d Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \log\left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z −∫ZQ(Z)⋅log[Q(Z)P(Z∣X)]dZ是关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)和隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的 K L \mathcal K\mathcal L KL散度;
因此, log P ( X ) \log P(\mathcal X) logP(X)将表示如下: log P ( X ) = E L B O + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = ELBO + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=ELBO+KL[Q(Z)∣∣P(Z∣X)]
-
将概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)看成一个变量,并将 E L B O ELBO ELBO部分定义成一个关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的函数: L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]:
称
L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]为
Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的变分。
log P ( X ) = L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)] 观察上式,根据数学定义, K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] ≥ 0 \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \geq 0 KL[Q(Z)∣∣P(Z∣X)]≥0恒成立,则有: log P ( X ) ≥ L [ Q ( Z ) ] \log P(\mathcal X) \geq \mathcal L[\mathcal Q(\mathcal Z)] logP(X)≥L[Q(Z)] -
换句话说, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界,并且 上界结果是 log P ( X ) \log P(\mathcal X) logP(X)。
当观测变量 X \mathcal X X给定的条件下, log P ( X ) \log P(\mathcal X) logP(X)可看作固定常数;因此 L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)]结果也是固定常数。
在上述条件下, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]在什么情况下能够取得最大值?自然是 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于0,那么 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于上界 log P ( X ) \log P(\mathcal X) logP(X)。
而 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]本身数学性质表示概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的关联关系: Q ( Z ) ≈ P ( Z ∣ X ) ⇒ K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] → 0 \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) \Rightarrow \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \to 0 Q(Z)≈P(Z∣X)⇒KL[Q(Z)∣∣P(Z∣X)]→0
至此,逻辑关系已经找到: 目标:找到一个与隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)近似的分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z),替代 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)参与运算;
- 基于 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界 log P ( X ) \log P(\mathcal X) logP(X),并且 log P ( X ) \log P(\mathcal X) logP(X)只和观测变量 X \mathcal X X有关,视为常数。
- 在 log P ( X ) \log P(\mathcal X) logP(X)固定的情况下, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]越大, K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越小,因而 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)越近似。
从而将寻找近似分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)转化为最大化 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]的问题。
最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程基于上述问题转化,基于隐变量的最优分布 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z)可进行如下表示: Q ^ ( Z ) = arg max Q ( Z ) L [ Q ( Z ) ] ⇒ Q ^ ( Z ) ≈ P ( Z ∣ X ) \hat {\mathcal Q}(\mathcal Z) = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \Rightarrow \hat {\mathcal Q}(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q^(Z)=Q(Z)argmaxL[Q(Z)]⇒Q^(Z)≈P(Z∣X)
基于平均场假设的求解思想对概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)进行假设:平均场假设(Mean Theory)。 上面定义 Z \mathcal Z Z是隐变量和模型参数 的统称。这里关于 Z \mathcal Z Z数量的统计是指 概率图中状态变量的数量。假设状态变量的数量是 K \mathcal K K个,并将其划分成 M \mathcal M M个组,每个组中的 状态变量之间不存在重复。 数学符号表达如下: Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) = Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) \begin{aligned} \mathcal Q(\mathcal Z) & = \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \\ & = \mathcal Q_1(\mathcal Z^{(1)}) \cdot \mathcal Q_2(\mathcal Z^{(2)}) \cdots \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \end{aligned} Q(Z)=i=1∏MQi(Z(i))=Q1(Z(1))⋅Q2(Z(2))⋯QM(Z(M)) 其中, Z ( i ) \mathcal Z^{(i)} Z(i)表示第 i i i个子集合包含的 状态变量的具体结果,而 Q i ( Z ( i ) ) \mathcal Q_i(\mathcal Z^{(i)}) Qi(Z(i))表示向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布。也可以理解为:第 i i i个子集合所包含状态变量的联合概率分布。
- 具体想法是,由于各子集合之间相互独立,想要求解
Q
j
(
Z
(
j
)
)
(
j
=
{
1
,
2
,
⋯
,
M
}
)
\mathcal Q_j(\mathcal Z^{(j)}) (j=\{1,2,\cdots,\mathcal M\})
Qj(Z(j))(j={1,2,⋯,M}),则需要固定
Q
−
j
(
Z
(
−
j
)
)
\mathcal Q_{-j}(\mathcal Z^{(-j)})
Q−j(Z(−j))分量的结果;
Q
−
j
(
Z
(
−
j
)
)
\mathcal Q_{-j}(\mathcal Z^{(-j)})
Q−j(Z(−j))
表示除去编号
j j j子集合后剩余子集合的概率分布。
- 基于上述步骤,会分别求解出各向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布 Q i ( Z ( i ) ) ( i = 1 , 2 , ⋯ M ) \mathcal Q_{i}(\mathcal Z^{(i)})(i=1,2,\cdots \mathcal M) Qi(Z(i))(i=1,2,⋯M),最后相乘即可。
将 Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) \mathcal Q(\mathcal Z) =\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) Q(Z)=∏i=1MQi(Z(i))带入 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]中: L [ Q ( Z ) ] = ∫ Z Q ( Z ) ⋅ log [ P ( X , Z ) Q ( Z ) ] d Z = ∫ Z Q ( Z ) log P ( X , Z ) d Z − ∫ Z Q ( Z ) log Q ( Z ) d Z \begin{aligned} \mathcal L[\mathcal Q(\mathcal Z)] & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \\ & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log P(\mathcal X,\mathcal Z)d \mathcal Z - \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \mathcal Q(\mathcal Z) d\mathcal Z \end{aligned} L[Q(Z)]=∫ZQ(Z)⋅log[Q(Z)P(X,Z)]dZ=∫ZQ(Z)logP(X,Z)dZ−∫ZQ(Z)logQ(Z)dZ
-
带入第一项:
按照上述固定要求,将
Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))提出来。
∫ Z ∏ i = 1 M [ Q i ( Z ( i ) ) ⋅ log P ( X , Z ) ] d Z ( 1 ) , ⋯ , d Z ( M ) = ∫ Z ( j ) Q j ( Z ( j ) ) ⋅ { ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log P ( X , Z ) ] d Z ( 1 ) , ⋯ , Z ( j − 1 ) , Z ( j + 1 ) , ⋯ , Z ( M ) } d Z ( j ) \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)}) \cdot \log P(\mathcal X,\mathcal Z) \right] d\mathcal Z^{(1)},\cdots,d\mathcal Z^{(\mathcal M)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \cdot \left\{\int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)}\right\} d \mathcal Z^{(j)} \end{aligned} ∫Zi=1∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,dZ(M)=∫Z(j)Qj(Z(j))⋅⎩ ⎨ ⎧∫Z(1)⋯∫Z(j−1)∫Z(j+1)⋯∫Z(M)i=j∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)⎭ ⎬ ⎫dZ(j) 观察上述大括号中的项,我们可以将其看做期望形式: ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log P ( X , Z ) ] d Z ( 1 ) , ⋯ , Z ( j − 1 ) , Z ( j + 1 ) , ⋯ , Z ( M ) = E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] \begin{aligned} & \int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)} \\ & = \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] \end{aligned} ∫Z(1)⋯∫Z(j−1)∫Z(j+1)⋯∫Z(M)i=j∏M[Qi(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)=E∏i=jMQi(Z(i))[logP(X,Z)] 至此,第一项表示为: ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] d Z j \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z_j ∫ZjQj(Z(j))E∏i=jMQi(Z(i))[logP(X,Z)]dZj -
带入第二项:
log部分改成求和方式~
∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ∏ i = 1 M Q i ( Z ( i ) ) d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ ∑ i = 1 M log Q i ( Z ( i ) ) d Z \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \\ & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \sum_{i=1}^{\mathcal M} \log\mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \end{aligned} ∫Zi=1∏MQi(Z(i))⋅logi=1∏MQi(Z(i))dZ=∫Zi=1∏MQi(Z(i))⋅i=1∑MlogQi(Z(i))dZ 继续展开(连加部分展开): = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ [ log Q 1 ( Z ( 1 ) ) + log Q 2 ( Z ( 2 ) ) + ⋯ + log Q M ( Z ( M ) ) ] d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log Q 1 ( Z ( 1 ) ) d Z + ⋯ + ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log Q M ( Z ( M ) ) d Z \begin{aligned} & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \left[\log \mathcal Q_1(\mathcal Z^{(1)}) + \log \mathcal Q_2(\mathcal Z^{(2)}) + \cdots + \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \right] d\mathcal Z \\ & = \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z + \cdots + \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) d\mathcal Z \end{aligned} =∫Zi=1∏MQi(Z(i))⋅[logQ1(Z(1))+logQ2(Z(2))+⋯+logQM(Z(M))]dZ=∫Zi=1∏MQi(Z(i))⋅logQ1(Z(1))dZ+⋯+∫Zi=1∏MQi(Z(i))⋅logQM(Z(M))dZ- 观察上式中的第一项: ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log Q 1 ( Z ( 1 ) ) d Z \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z ∫Zi=1∏MQi(Z(i))⋅logQ1(Z(1))dZ
- 将该项中的多重积分和连乘符号全部展开: ∫ Z ( 1 ) ∫ Z ( 2 ) ⋯ ∫ Z ( M ) [ Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) ⋅ log Q 1 ( Z ( 1 ) ) ] d Z ( 1 ) , Z ( 2 ) , ⋯ , Z ( M ) \int_{\mathcal Z^{(1)}} \int_{\mathcal Z^{(2)}} \cdots \int_{\mathcal Z^{(\mathcal M)}} \left[\mathcal Q_1(\mathcal Z^{(1)}) \cdot Q_2(\mathcal Z^{(2)}) \cdots Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)})\right] d \mathcal Z^{(1)},\mathcal Z^{(2)},\cdots,\mathcal Z^{(\mathcal M)} ∫Z(1)∫Z(2)⋯∫Z(M)[Q1(Z(1))⋅Q2(Z(2))⋯QM(Z(M))⋅logQ1(Z(1))]dZ(1),Z(2),⋯,Z(M)
- 将积分带入: ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log Q 1 ( Z ( 1 ) ) d Z ( 1 ) ⋅ ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) ⋯ ∫ Z ( M ) Q M ( Z ( M ) ) \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} \cdot \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) \cdots \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) ∫Z(1)Q1(Z(1))logQ1(Z(1))dZ(1)⋅∫Z(2)Q2(Z(2))⋯∫Z(M)QM(Z(M))
- 从第二项开始到最后一项,使用概率密度积分结果均为 1 1 1: ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) = ⋯ = ∫ Z ( M ) Q M ( Z ( M ) ) = 1 \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) =\cdots = \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) = 1 ∫Z(2)Q2(Z(2))=⋯=∫Z(M)QM(Z(M))=1
- 因此,则有: ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log Q 1 ( Z ( 1 ) ) d Z = ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log Q 1 ( Z ( 1 ) ) d Z ( 1 ) \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z = \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} ∫Zi=1∏MQi(Z(i))⋅logQ1(Z(1))dZ=∫Z(1)Q1(Z(1))logQ1(Z(1))dZ(1)
- 因此,整个 连加部分展开式 可以写成: ∑ i = 1 M ∫ Z ( i ) Q i ( Z ( i ) ) log Q i ( Z ( i ) ) d Z ( i ) \sum_{i=1}^{\mathcal M} \int_{\mathcal Z^{(i)}} \mathcal Q_i(\mathcal Z^{(i)}) \log \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z^{(i)} i=1∑M∫Z(i)Qi(Z(i))logQi(Z(i))dZ(i)
- 基于求解项是
Z
(
j
)
\mathcal Z^{(j)}
Z(j),将
Z
(
j
)
\mathcal Z^{(j)}
Z(j)与其他向分开:
其他项不是需要求解的内容,视为常数。用
C \mathcal C C表示。
∫ Z ( j ) Q j ( Z ( j ) ) log Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)+C
-
至此,观察第一项和第二项的结果: ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] d Z ( j ) ∫ Z ( j ) Q j ( Z ( j ) ) log Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z^{(j)} \\ \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ∫ZjQj(Z(j))E∏i=jMQi(Z(i))[logP(X,Z)]dZ(j)∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)+C 这两项很相似,它们有共同的积分 ∫ Z ( j ) \int_{\mathcal Z^{(j)}} ∫Z(j),第一个乘数项均是 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j)),因此,为了能够使两项合并,设定 E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] E∏i=jMQi(Z(i))[logP(X,Z)]是关于 X , Z ( j ) \mathcal X,\mathcal Z^{(j)} X,Z(j)的函数。即:
首先,
Z \mathcal Z Z中包含
Z ( j ) \mathcal Z^{(j)} Z(j);其次,由于除了
Z ( j ) \mathcal Z^{(j)} Z(j)之外,其余变量均被固定住,视作常量。因此则有
ϕ ^ ( X , Z ( j ) ) \hat \phi(\mathcal X,\mathcal Z^{(j)}) ϕ^(X,Z(j))的表达。
E ∏ i ≠ j M Q i ( Z ( i ) ) [ log P ( X , Z ) ] = log ϕ ^ ( X , Z ( j ) ) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] = \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) E∏i=jMQi(Z(i))[logP(X,Z)]=logϕ^(X,Z(j)) 此时,两项合并的结果如下: ∫ Z ( j ) Q j ( Z ( j ) ) log ϕ ^ ( X , Z ( j ) ) d Z ( j ) − ∫ Z ( j ) Q j ( Z ( j ) ) log Q j ( Z ( j ) ) d Z ( j ) = ∫ Z ( j ) Q j ( Z ( j ) ) [ log ϕ ^ ( X , Z ( j ) ) − log Q j ( Z ( j ) ) ] = ∫ Z ( j ) Q j ( Z ( j ) ) log [ ϕ ^ ( X , Z ( j ) ) Q j ( Z ( j ) ) ] d Z ( j ) + C = − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] \begin{aligned} & \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) d\mathcal Z^{(j)} - \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \left[\log \hat \phi(\mathcal X,\mathcal Z^{(j)}) - \log \mathcal Q_j(\mathcal Z^{(j)})\right] \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \left[\frac{\hat \phi(\mathcal X,\mathcal Z^{(j)})}{\mathcal Q_j(\mathcal Z^{(j)})}\right]d \mathcal Z^{(j)} +\mathcal C \\ & = - \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right] \end{aligned} ∫Z(j)Qj(Z(j))logϕ^(X,Z(j))dZ(j)−∫Z(j)Qj(Z(j))logQj(Z(j))dZ(j)=∫Z(j)Qj(Z(j))[logϕ^(X,Z(j))−logQj(Z(j))]=∫Z(j)Qj(Z(j))log[Qj(Z(j))ϕ^(X,Z(j))]dZ(j)+C=−KL[ϕ^(X,Z(j))∣∣Qj(Z(j))] 最终: Q j ^ ( Z ( j ) ) = arg max Q ( Z ) L [ Q ( Z ) ] = arg max Q ( Z ) { − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] } \begin{aligned} \hat{\mathcal Q_j}(\mathcal Z^{(j)}) & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \\ & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \left\{- \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right]\right\} \end{aligned} Qj^(Z(j))=Q(Z)argmaxL[Q(Z)]=Q(Z)argmax{−KL[ϕ^(X,Z(j))∣∣Qj(Z(j))]} 因此,基于平均场假设的 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj(Z(j))的最优解即: Q j ^ ( Z ( j ) ) = ϕ ^ ( X , Z ( j ) ) ( j ∈ { 1 , 2 , ⋯ , M } ) \hat {\mathcal Q_j}(\mathcal Z^{(j)}) = \hat \phi(\mathcal X,\mathcal Z^{(j)}) (j \in \{1,2,\cdots,\mathcal M\}) Qj^(Z(j))=ϕ^(X,Z(j))(j∈{1,2,⋯,M}) 从而最终求得最优解 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^(Z): Q ^ ( Z ) = ∏ j = 1 M Q j ( Z ( j ) ) \hat {\mathcal Q}(\mathcal Z) = \prod_{j=1}^{\mathcal M} \mathcal Q_j(\mathcal Z^{(j)}) Q^(Z)=j=1∏MQj(Z(j))
基于平均场假设,关于 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的近似求解推导结束。下一节将介绍变分推断与EM算法之间的关联关系。
相关参考: 变分法:理解变分 机器学习-变分推断2(公式推导)