您当前的位置: 首页 >  机器学习

静静的喝酒

暂无认证

  • 5浏览

    0关注

    99博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

机器学习笔记之变分推断(二)公式推导过程(基于平均场假设)

静静的喝酒 发布时间:2022-09-15 16:45:30 ,浏览量:5

机器学习笔记之变分推断——基于平均场假设的公式推导过程
  • 引言
    • 回顾:推断与变分推断
    • 变分推断:公式推导过程
      • 初始转化过程
      • 最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程
        • 基于平均场假设的求解思想
        • 求解过程实现

引言

上一节介绍了分别从频率角度和贝叶斯角度认识机器学习问题,并介绍了推断(Inference)在整个贝叶斯角度的重要作用。本节将正式介绍确定性近似推断的代表方法——变分推断(Variational Inference)

提示:本节推导过程与EM算法部分存在相似之处,请对比食用,谢谢。

回顾:推断与变分推断

关于从贝叶斯角度认识问题,本质上是给定样本集合 X \mathcal X X,针对陌生数据 x ^ \hat x x^的预测问题,即 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)。

基于上述逻辑,贝叶斯角度的具体做法是:针对样本集合 X \mathcal X X构建模型,通过模型参数 θ \theta θ作为样本集合 X \mathcal X X与陌生数据 x ^ \hat x x^之间构建关系的桥梁,将 P ( x ^ ∣ X ) P(\hat x \mid \mathcal X) P(x^∣X)表示为如下形式: P ( x ^ ∣ X ) = ∫ θ P ( x ^ , θ ∣ X ) d θ = ∫ θ P ( x ^ ∣ θ ) ⋅ P ( θ ∣ X ) d θ \begin{aligned} P(\hat x \mid \mathcal X) & = \int_{\theta} P(\hat x ,\theta \mid \mathcal X) d\theta \\ & = \int_{\theta} P(\hat x \mid \theta) \cdot P(\theta \mid \mathcal X)d\theta \end{aligned} P(x^∣X)​=∫θ​P(x^,θ∣X)dθ=∫θ​P(x^∣θ)⋅P(θ∣X)dθ​

基于上述公式,使用贝叶斯定理求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X): P ( θ ∣ X ) = P ( X ∣ θ ) ⋅ P ( θ ) P ( X ) \begin{aligned} P(\theta \mid \mathcal X) & = \frac{P(\mathcal X \mid \theta) \cdot P(\theta)}{P(\mathcal X)} \end{aligned} P(θ∣X)​=P(X)P(X∣θ)⋅P(θ)​​ 至此,关于求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)的过程称为推断; 关于样本数据的边缘概率分布 P ( X ) P(\mathcal X) P(X)可看成一个积分操作: 引入‘隐变量’。 P ( X ) = ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z = ∫ z 1 ⋯ ∫ z K P ( X ∣ Z ) ⋅ P ( Z ) d z , ⋯   , z K \begin{aligned} P(\mathcal X) & = \int_{\mathcal Z} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z)d\mathcal Z \\ & = \int_{z_1} \cdots \int_{z_{\mathcal K}} P(\mathcal X \mid \mathcal Z) \cdot P(\mathcal Z) d z_,\cdots,z_{\mathcal K} \end{aligned} P(X)​=∫Z​P(X∣Z)⋅P(Z)dZ=∫z1​​⋯∫zK​​P(X∣Z)⋅P(Z)dz,​⋯,zK​​ 如果引入的隐变量 Z = ( z 1 , ⋯   , z K ) T \mathcal Z = (z_1,\cdots,z_{\mathcal K})^{T} Z=(z1​,⋯,zK​)T中的维度 K \mathcal K K过高,导致 P ( X ) P(\mathcal X) P(X)积分困难,最终使得 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)无法求解。 因此,需要使用一些方法近似求解 P ( θ ∣ X ) P(\theta \mid \mathcal X) P(θ∣X)。而变分推断(Variational Inference,VI)就是 近似推断中,确定性近似的代表方法。

变分推断:公式推导过程

近似推断(Approximate Inference)的核心观点是针对 ∫ Z P ( X ∣ Z ) ⋅ P ( Z ) d Z \int_{\mathcal Z} P(\mathcal X \mid Z) \cdot P(\mathcal Z)d\mathcal Z ∫Z​P(X∣Z)⋅P(Z)dZ积分困难的问题,通过找出一个关于隐变量 Z \mathcal Z Z的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)去逼近后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)。即: Q ( Z ) ≈ P ( Z ∣ X ) \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q(Z)≈P(Z∣X) 这里定义:

  • X \mathcal X X为观测变量(Observed Data),即真实的样本数据;
  • Z \mathcal Z Z表示 隐变量(Latent Data)和 模型参数(Parameter)的统称。 因为‘隐变量’本身就不真实存在,它只是一个‘表达’概率模型的中间环节。因此,‘隐变量 + 模型参数’合并在一起是合理的。 该定义与EM算法中的定义式略有区分的,EM算法中,隐变量是隐变量,参数是参数。
  • 依然将联合概率分布 ( X , Z ) (\mathcal X,\mathcal Z) (X,Z)称作 完整数据(Complete Data)。
初始转化过程

在EM算法中,底层逻辑是使用极大似然估计(Maximum Likelihood Estimate,MLE)进行求解,并且求解的是模型参数 θ \theta θ; 在变分推断求解过程中,模型参数 θ \theta θ合并进隐变量 Z \mathcal Z Z中。这里依然从概率模型 P ( X ) P(\mathcal X) P(X)入手,执行推导过程: 条件概率公式~为方便推导过程,依然保留‘log函数’。 log ⁡ P ( X ) = log ⁡ [ P ( X , Z ) P ( Z ∣ X ) ] = log ⁡ P ( X , Z ) − log ⁡ P ( Z ∣ X ) \begin{aligned} \log P(\mathcal X) & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{P(\mathcal Z \mid \mathcal X)}\right] \\ & = \log P(\mathcal X,\mathcal Z) - \log P(\mathcal Z \mid \mathcal X) \end{aligned} logP(X)​=log[P(Z∣X)P(X,Z)​]=logP(X,Z)−logP(Z∣X)​

  • 和EM算法推导思路相同,引入一个关于隐变量的概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)。则有: log ⁡ P ( X ) = [ log ⁡ P ( X , Z ) − log ⁡ Q ( Z ) ] − [ log ⁡ P ( Z ∣ X ) − log ⁡ Q ( Z ) ] = log ⁡ [ P ( X , Z ) Q ( Z ) ] − log ⁡ [ P ( Z ∣ X ) Q ( Z ) ] \begin{aligned} \log P(\mathcal X) & = \left[\log P(\mathcal X,\mathcal Z) - \log \mathcal Q(\mathcal Z)\right] - \left[\log P(\mathcal Z \mid \mathcal X) - \log \mathcal Q(\mathcal Z)\right] \\ & = \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] - \log \left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] \end{aligned} logP(X)​=[logP(X,Z)−logQ(Z)]−[logP(Z∣X)−logQ(Z)]=log[Q(Z)P(X,Z)​]−log[Q(Z)P(Z∣X)​]​

  • 等式两端分别对 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)求积分:

    • 等式左端: ∫ Z log ⁡ P ( X ) ⋅ Q ( Z ) d Z = log ⁡ P ( X ) ∫ Z Q ( Z ) d Z = log ⁡ P ( X ) \int_{\mathcal Z} \log P(\mathcal X) \cdot \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) \int_{\mathcal Z} \mathcal Q(\mathcal Z) d\mathcal Z = \log P(\mathcal X) ∫Z​logP(X)⋅Q(Z)dZ=logP(X)∫Z​Q(Z)dZ=logP(X)
    • 等式右端: ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z − ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( Z ∣ X ) Q ( Z ) ] d Z \begin{aligned} \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \log\left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \end{aligned} ∫Z​Q(Z)⋅log[Q(Z)P(X,Z)​]dZ−∫Z​Q(Z)⋅log[Q(Z)P(Z∣X)​]dZ​
  • 在EM算法中定义 ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z ∫Z​Q(Z)⋅log[Q(Z)P(X,Z)​]dZ为 证据下界(Evidence Lower Bound,ELBO); − ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( Z ∣ X ) Q ( Z ) ] d Z - \int_{\mathcal Z}\mathcal Q(\mathcal Z) \cdot \log\left[\frac{P(\mathcal Z \mid \mathcal X)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z −∫Z​Q(Z)⋅log[Q(Z)P(Z∣X)​]dZ是关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)和隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的 K L \mathcal K\mathcal L KL散度;

    因此, log ⁡ P ( X ) \log P(\mathcal X) logP(X)将表示如下: log ⁡ P ( X ) = E L B O + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = ELBO + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=ELBO+KL[Q(Z)∣∣P(Z∣X)]

  • 将概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)看成一个变量,并将 E L B O ELBO ELBO部分定义成一个关于 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的函数: L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]: L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)] Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)的变分。 log ⁡ P ( X ) = L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \log P(\mathcal X) = \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] logP(X)=L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)] 观察上式,根据数学定义, K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] ≥ 0 \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \geq 0 KL[Q(Z)∣∣P(Z∣X)]≥0恒成立,则有: log ⁡ P ( X ) ≥ L [ Q ( Z ) ] \log P(\mathcal X) \geq \mathcal L[\mathcal Q(\mathcal Z)] logP(X)≥L[Q(Z)]

  • 换句话说, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界,并且 上界结果是 log ⁡ P ( X ) \log P(\mathcal X) logP(X)。

    当观测变量 X \mathcal X X给定的条件下, log ⁡ P ( X ) \log P(\mathcal X) logP(X)可看作固定常数;因此 L [ Q ( Z ) ] + K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal L[\mathcal Q(\mathcal Z)] + \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] L[Q(Z)]+KL[Q(Z)∣∣P(Z∣X)]结果也是固定常数。

    在上述条件下, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]在什么情况下能够取得最大值?自然是 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于0,那么 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越趋近于上界 log ⁡ P ( X ) \log P(\mathcal X) logP(X)。

    而 K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]本身数学性质表示概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的关联关系: Q ( Z ) ≈ P ( Z ∣ X ) ⇒ K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] → 0 \mathcal Q(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) \Rightarrow \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] \to 0 Q(Z)≈P(Z∣X)⇒KL[Q(Z)∣∣P(Z∣X)]→0

至此,逻辑关系已经找到: 目标:找到一个与隐变量 Z \mathcal Z Z的后验概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)近似的分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z),替代 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)参与运算;

  • 基于 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]存在上界 log ⁡ P ( X ) \log P(\mathcal X) logP(X),并且 log ⁡ P ( X ) \log P(\mathcal X) logP(X)只和观测变量 X \mathcal X X有关,视为常数。
  • 在 log ⁡ P ( X ) \log P(\mathcal X) logP(X)固定的情况下, L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]越大, K L [ Q ( Z ) ∣ ∣ P ( Z ∣ X ) ] \mathcal K\mathcal L[\mathcal Q(\mathcal Z) || P(\mathcal Z \mid \mathcal X)] KL[Q(Z)∣∣P(Z∣X)]越小,因而 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)与 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)越近似。

从而将寻找近似分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)转化为最大化 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]的问题。

最大变分 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]求解过程

基于上述问题转化,基于隐变量的最优分布 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^​(Z)可进行如下表示: Q ^ ( Z ) = arg ⁡ max ⁡ Q ( Z ) L [ Q ( Z ) ] ⇒ Q ^ ( Z ) ≈ P ( Z ∣ X ) \hat {\mathcal Q}(\mathcal Z) = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \Rightarrow \hat {\mathcal Q}(\mathcal Z) \approx P(\mathcal Z \mid \mathcal X) Q^​(Z)=Q(Z)argmax​L[Q(Z)]⇒Q^​(Z)≈P(Z∣X)

基于平均场假设的求解思想

对概率分布 Q ( Z ) \mathcal Q(\mathcal Z) Q(Z)进行假设:平均场假设(Mean Theory)。 上面定义 Z \mathcal Z Z是隐变量和模型参数 的统称。这里关于 Z \mathcal Z Z数量的统计是指 概率图中状态变量的数量。假设状态变量的数量是 K \mathcal K K个,并将其划分成 M \mathcal M M个组,每个组中的 状态变量之间不存在重复。 数学符号表达如下: Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) = Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) \begin{aligned} \mathcal Q(\mathcal Z) & = \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \\ & = \mathcal Q_1(\mathcal Z^{(1)}) \cdot \mathcal Q_2(\mathcal Z^{(2)}) \cdots \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \end{aligned} Q(Z)​=i=1∏M​Qi​(Z(i))=Q1​(Z(1))⋅Q2​(Z(2))⋯QM​(Z(M))​ 其中, Z ( i ) \mathcal Z^{(i)} Z(i)表示第 i i i个子集合包含的 状态变量的具体结果,而 Q i ( Z ( i ) ) \mathcal Q_i(\mathcal Z^{(i)}) Qi​(Z(i))表示向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布。也可以理解为:第 i i i个子集合所包含状态变量的联合概率分布。

  • 具体想法是,由于各子集合之间相互独立,想要求解 Q j ( Z ( j ) ) ( j = { 1 , 2 , ⋯   , M } ) \mathcal Q_j(\mathcal Z^{(j)}) (j=\{1,2,\cdots,\mathcal M\}) Qj​(Z(j))(j={1,2,⋯,M}),则需要固定 Q − j ( Z ( − j ) ) \mathcal Q_{-j}(\mathcal Z^{(-j)}) Q−j​(Z(−j))分量的结果; Q − j ( Z ( − j ) ) \mathcal Q_{-j}(\mathcal Z^{(-j)}) Q−j​(Z(−j))表示除去编号 j j j子集合后剩余子集合的概率分布。
  • 基于上述步骤,会分别求解出各向量 Z ( i ) \mathcal Z^{(i)} Z(i)对应的概率分布 Q i ( Z ( i ) ) ( i = 1 , 2 , ⋯ M ) \mathcal Q_{i}(\mathcal Z^{(i)})(i=1,2,\cdots \mathcal M) Qi​(Z(i))(i=1,2,⋯M),最后相乘即可。
求解过程实现

将 Q ( Z ) = ∏ i = 1 M Q i ( Z ( i ) ) \mathcal Q(\mathcal Z) =\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) Q(Z)=∏i=1M​Qi​(Z(i))带入 L [ Q ( Z ) ] \mathcal L[\mathcal Q(\mathcal Z)] L[Q(Z)]中: L [ Q ( Z ) ] = ∫ Z Q ( Z ) ⋅ log ⁡ [ P ( X , Z ) Q ( Z ) ] d Z = ∫ Z Q ( Z ) log ⁡ P ( X , Z ) d Z − ∫ Z Q ( Z ) log ⁡ Q ( Z ) d Z \begin{aligned} \mathcal L[\mathcal Q(\mathcal Z)] & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \cdot \log \left[\frac{P(\mathcal X,\mathcal Z)}{\mathcal Q(\mathcal Z)}\right] d\mathcal Z \\ & = \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log P(\mathcal X,\mathcal Z)d \mathcal Z - \int_{\mathcal Z} \mathcal Q(\mathcal Z) \log \mathcal Q(\mathcal Z) d\mathcal Z \end{aligned} L[Q(Z)]​=∫Z​Q(Z)⋅log[Q(Z)P(X,Z)​]dZ=∫Z​Q(Z)logP(X,Z)dZ−∫Z​Q(Z)logQ(Z)dZ​

  • 带入第一项: 按照上述固定要求,将 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj​(Z(j))提出来。 ∫ Z ∏ i = 1 M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯   , d Z ( M ) = ∫ Z ( j ) Q j ( Z ( j ) ) ⋅ { ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯   , Z ( j − 1 ) , Z ( j + 1 ) , ⋯   , Z ( M ) } d Z ( j ) \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)}) \cdot \log P(\mathcal X,\mathcal Z) \right] d\mathcal Z^{(1)},\cdots,d\mathcal Z^{(\mathcal M)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \cdot \left\{\int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)}\right\} d \mathcal Z^{(j)} \end{aligned} ​∫Z​i=1∏M​[Qi​(Z(i))⋅logP(X,Z)]dZ(1),⋯,dZ(M)=∫Z(j)​Qj​(Z(j))⋅⎩ ⎨ ⎧​∫Z(1)​⋯∫Z(j−1)​∫Z(j+1)​⋯∫Z(M)​i=j∏M​[Qi​(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)⎭ ⎬ ⎫​dZ(j)​ 观察上述大括号中的项,我们可以将其看做期望形式: ∫ Z ( 1 ) ⋯ ∫ Z ( j − 1 ) ∫ Z ( j + 1 ) ⋯ ∫ Z ( M ) ∏ i ≠ j M [ Q i ( Z ( i ) ) ⋅ log ⁡ P ( X , Z ) ] d Z ( 1 ) , ⋯   , Z ( j − 1 ) , Z ( j + 1 ) , ⋯   , Z ( M ) = E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] \begin{aligned} & \int_{\mathcal Z^{(1)}}\cdots \int_{\mathcal Z^{(j-1)}} \int_{\mathcal Z^{(j+1)}}\cdots\int_{\mathcal Z^{(\mathcal M)}} \prod_{i \neq j}^{\mathcal M} \left[\mathcal Q_i(\mathcal Z^{(i)})\cdot \log P(\mathcal X,\mathcal Z)\right] d\mathcal Z^{(1)},\cdots,\mathcal Z^{(j-1)},\mathcal Z^{(j+1)},\cdots,\mathcal Z^{(\mathcal M)} \\ & = \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] \end{aligned} ​∫Z(1)​⋯∫Z(j−1)​∫Z(j+1)​⋯∫Z(M)​i=j∏M​[Qi​(Z(i))⋅logP(X,Z)]dZ(1),⋯,Z(j−1),Z(j+1),⋯,Z(M)=E∏i=jM​Qi​(Z(i))​[logP(X,Z)]​ 至此,第一项表示为: ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] d Z j \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z_j ∫Zj​​Qj​(Z(j))E∏i=jM​Qi​(Z(i))​[logP(X,Z)]dZj​

  • 带入第二项: log部分改成求和方式~ ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ ∏ i = 1 M Q i ( Z ( i ) ) d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ ∑ i = 1 M log ⁡ Q i ( Z ( i ) ) d Z \begin{aligned} & \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \\ & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \sum_{i=1}^{\mathcal M} \log\mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z \end{aligned} ​∫Z​i=1∏M​Qi​(Z(i))⋅logi=1∏M​Qi​(Z(i))dZ=∫Z​i=1∏M​Qi​(Z(i))⋅i=1∑M​logQi​(Z(i))dZ​ 继续展开(连加部分展开): = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ [ log ⁡ Q 1 ( Z ( 1 ) ) + log ⁡ Q 2 ( Z ( 2 ) ) + ⋯ + log ⁡ Q M ( Z ( M ) ) ] d Z = ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z + ⋯ + ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q M ( Z ( M ) ) d Z \begin{aligned} & = \int_{\mathcal Z} \prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \left[\log \mathcal Q_1(\mathcal Z^{(1)}) + \log \mathcal Q_2(\mathcal Z^{(2)}) + \cdots + \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \right] d\mathcal Z \\ & = \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z + \cdots + \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) d\mathcal Z \end{aligned} ​=∫Z​i=1∏M​Qi​(Z(i))⋅[logQ1​(Z(1))+logQ2​(Z(2))+⋯+logQM​(Z(M))]dZ=∫Z​i=1∏M​Qi​(Z(i))⋅logQ1​(Z(1))dZ+⋯+∫Z​i=1∏M​Qi​(Z(i))⋅logQM​(Z(M))dZ​

    • 观察上式中的第一项: ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z ∫Z​i=1∏M​Qi​(Z(i))⋅logQ1​(Z(1))dZ
    • 将该项中的多重积分和连乘符号全部展开: ∫ Z ( 1 ) ∫ Z ( 2 ) ⋯ ∫ Z ( M ) [ Q 1 ( Z ( 1 ) ) ⋅ Q 2 ( Z ( 2 ) ) ⋯ Q M ( Z ( M ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) ] d Z ( 1 ) , Z ( 2 ) , ⋯   , Z ( M ) \int_{\mathcal Z^{(1)}} \int_{\mathcal Z^{(2)}} \cdots \int_{\mathcal Z^{(\mathcal M)}} \left[\mathcal Q_1(\mathcal Z^{(1)}) \cdot Q_2(\mathcal Z^{(2)}) \cdots Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)})\right] d \mathcal Z^{(1)},\mathcal Z^{(2)},\cdots,\mathcal Z^{(\mathcal M)} ∫Z(1)​∫Z(2)​⋯∫Z(M)​[Q1​(Z(1))⋅Q2​(Z(2))⋯QM​(Z(M))⋅logQ1​(Z(1))]dZ(1),Z(2),⋯,Z(M)
    • 将积分带入: ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log ⁡ Q 1 ( Z ( 1 ) ) d Z ( 1 ) ⋅ ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) ⋯ ∫ Z ( M ) Q M ( Z ( M ) ) \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} \cdot \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) \cdots \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) ∫Z(1)​Q1​(Z(1))logQ1​(Z(1))dZ(1)⋅∫Z(2)​Q2​(Z(2))⋯∫Z(M)​QM​(Z(M))
    • 从第二项开始到最后一项,使用概率密度积分结果均为 1 1 1: ∫ Z ( 2 ) Q 2 ( Z ( 2 ) ) = ⋯ = ∫ Z ( M ) Q M ( Z ( M ) ) = 1 \int_{\mathcal Z^{(2)}} \mathcal Q_2(\mathcal Z^{(2)}) =\cdots = \int_{\mathcal Z^{(\mathcal M)}} \mathcal Q_{\mathcal M}(\mathcal Z^{(\mathcal M)}) = 1 ∫Z(2)​Q2​(Z(2))=⋯=∫Z(M)​QM​(Z(M))=1
    • 因此,则有: ∫ Z ∏ i = 1 M Q i ( Z ( i ) ) ⋅ log ⁡ Q 1 ( Z ( 1 ) ) d Z = ∫ Z ( 1 ) Q 1 ( Z ( 1 ) ) log ⁡ Q 1 ( Z ( 1 ) ) d Z ( 1 ) \int_{\mathcal Z}\prod_{i=1}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)}) \cdot \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z = \int_{\mathcal Z^{(1)}} \mathcal Q_1(\mathcal Z^{(1)}) \log \mathcal Q_1(\mathcal Z^{(1)}) d\mathcal Z^{(1)} ∫Z​i=1∏M​Qi​(Z(i))⋅logQ1​(Z(1))dZ=∫Z(1)​Q1​(Z(1))logQ1​(Z(1))dZ(1)
    • 因此,整个 连加部分展开式 可以写成: ∑ i = 1 M ∫ Z ( i ) Q i ( Z ( i ) ) log ⁡ Q i ( Z ( i ) ) d Z ( i ) \sum_{i=1}^{\mathcal M} \int_{\mathcal Z^{(i)}} \mathcal Q_i(\mathcal Z^{(i)}) \log \mathcal Q_i(\mathcal Z^{(i)}) d\mathcal Z^{(i)} i=1∑M​∫Z(i)​Qi​(Z(i))logQi​(Z(i))dZ(i)
    • 基于求解项是 Z ( j ) \mathcal Z^{(j)} Z(j),将 Z ( j ) \mathcal Z^{(j)} Z(j)与其他向分开: 其他项不是需要求解的内容,视为常数。用 C \mathcal C C表示。 ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ∫Z(j)​Qj​(Z(j))logQj​(Z(j))dZ(j)+C
  • 至此,观察第一项和第二项的结果: ∫ Z j Q j ( Z ( j ) ) E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] d Z ( j ) ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) + C \int_{\mathcal Z_j} \mathcal Q_j(\mathcal Z^{(j)}) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] d \mathcal Z^{(j)} \\ \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} + \mathcal C ∫Zj​​Qj​(Z(j))E∏i=jM​Qi​(Z(i))​[logP(X,Z)]dZ(j)∫Z(j)​Qj​(Z(j))logQj​(Z(j))dZ(j)+C 这两项很相似,它们有共同的积分 ∫ Z ( j ) \int_{\mathcal Z^{(j)}} ∫Z(j)​,第一个乘数项均是 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj​(Z(j)),因此,为了能够使两项合并,设定 E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] E∏i=jM​Qi​(Z(i))​[logP(X,Z)]是关于 X , Z ( j ) \mathcal X,\mathcal Z^{(j)} X,Z(j)的函数。即: 首先, Z \mathcal Z Z中包含 Z ( j ) \mathcal Z^{(j)} Z(j); 其次,由于除了 Z ( j ) \mathcal Z^{(j)} Z(j)之外,其余变量均被固定住,视作常量。因此则有 ϕ ^ ( X , Z ( j ) ) \hat \phi(\mathcal X,\mathcal Z^{(j)}) ϕ^​(X,Z(j))的表达。 E ∏ i ≠ j M Q i ( Z ( i ) ) [ log ⁡ P ( X , Z ) ] = log ⁡ ϕ ^ ( X , Z ( j ) ) \mathbb E_{\prod_{i \neq j}^{\mathcal M} \mathcal Q_i(\mathcal Z^{(i)})}[\log P(\mathcal X,\mathcal Z)] = \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) E∏i=jM​Qi​(Z(i))​[logP(X,Z)]=logϕ^​(X,Z(j)) 此时,两项合并的结果如下: ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ ϕ ^ ( X , Z ( j ) ) d Z ( j ) − ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ Q j ( Z ( j ) ) d Z ( j ) = ∫ Z ( j ) Q j ( Z ( j ) ) [ log ⁡ ϕ ^ ( X , Z ( j ) ) − log ⁡ Q j ( Z ( j ) ) ] = ∫ Z ( j ) Q j ( Z ( j ) ) log ⁡ [ ϕ ^ ( X , Z ( j ) ) Q j ( Z ( j ) ) ] d Z ( j ) + C = − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] \begin{aligned} & \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \hat \phi(\mathcal X,\mathcal Z^{(j)}) d\mathcal Z^{(j)} - \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \mathcal Q_j(\mathcal Z^{(j)}) d\mathcal Z^{(j)} \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \left[\log \hat \phi(\mathcal X,\mathcal Z^{(j)}) - \log \mathcal Q_j(\mathcal Z^{(j)})\right] \\ & = \int_{\mathcal Z^{(j)}} \mathcal Q_j(\mathcal Z^{(j)}) \log \left[\frac{\hat \phi(\mathcal X,\mathcal Z^{(j)})}{\mathcal Q_j(\mathcal Z^{(j)})}\right]d \mathcal Z^{(j)} +\mathcal C \\ & = - \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right] \end{aligned} ​∫Z(j)​Qj​(Z(j))logϕ^​(X,Z(j))dZ(j)−∫Z(j)​Qj​(Z(j))logQj​(Z(j))dZ(j)=∫Z(j)​Qj​(Z(j))[logϕ^​(X,Z(j))−logQj​(Z(j))]=∫Z(j)​Qj​(Z(j))log[Qj​(Z(j))ϕ^​(X,Z(j))​]dZ(j)+C=−KL[ϕ^​(X,Z(j))∣∣Qj​(Z(j))]​ 最终: Q j ^ ( Z ( j ) ) = arg ⁡ max ⁡ Q ( Z ) L [ Q ( Z ) ] = arg ⁡ max ⁡ Q ( Z ) { − K L [ ϕ ^ ( X , Z ( j ) ) ∣ ∣ Q j ( Z ( j ) ) ] } \begin{aligned} \hat{\mathcal Q_j}(\mathcal Z^{(j)}) & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \mathcal L[\mathcal Q(\mathcal Z)] \\ & = \mathop{\arg\max}\limits_{\mathcal Q(\mathcal Z)} \left\{- \mathcal K\mathcal L \left[\hat \phi(\mathcal X,\mathcal Z^{(j)})||Q_j(\mathcal Z^{(j)})\right]\right\} \end{aligned} Qj​^​(Z(j))​=Q(Z)argmax​L[Q(Z)]=Q(Z)argmax​{−KL[ϕ^​(X,Z(j))∣∣Qj​(Z(j))]}​ 因此,基于平均场假设的 Q j ( Z ( j ) ) \mathcal Q_j(\mathcal Z^{(j)}) Qj​(Z(j))的最优解即: Q j ^ ( Z ( j ) ) = ϕ ^ ( X , Z ( j ) ) ( j ∈ { 1 , 2 , ⋯   , M } ) \hat {\mathcal Q_j}(\mathcal Z^{(j)}) = \hat \phi(\mathcal X,\mathcal Z^{(j)}) (j \in \{1,2,\cdots,\mathcal M\}) Qj​^​(Z(j))=ϕ^​(X,Z(j))(j∈{1,2,⋯,M}) 从而最终求得最优解 Q ^ ( Z ) \hat {\mathcal Q}(\mathcal Z) Q^​(Z): Q ^ ( Z ) = ∏ j = 1 M Q j ( Z ( j ) ) \hat {\mathcal Q}(\mathcal Z) = \prod_{j=1}^{\mathcal M} \mathcal Q_j(\mathcal Z^{(j)}) Q^​(Z)=j=1∏M​Qj​(Z(j))

基于平均场假设,关于 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(Z∣X)的近似求解推导结束。下一节将介绍变分推断与EM算法之间的关联关系。

相关参考: 变分法:理解变分 机器学习-变分推断2(公式推导)

关注
打赏
1664446683
查看更多评论
立即登录/注册

微信扫码登录

0.0474s