机器学习笔记之高斯分布——基于参数预测的有偏估计与无偏估计
- 目录
- 有偏估计与无偏估计
- 介绍
- 关于有偏与无偏估计的误区
- 高斯分布中基于参数预测的有偏估计与无偏估计
- 回顾
- 推导过程
- 为什么 σ M L E 2 \sigma_{MLE}^2 σMLE2会出现偏差
目录
有偏估计与无偏估计
介绍
在机器学习笔记——极大似然估计与最大后验概率估计中介绍到,包含 N N N个样本的样本集合 X = { x ( 1 ) , x ( 2 ) , . . . , x ( N ) } \begin{aligned}\mathcal X = \{ x^{(1)},x^{(2)},...,x^{(N)} \}\end{aligned} X={x(1),x(2),...,x(N)}可以理解为 从某一固定参数 θ \theta θ的概率模型 P ( X ∣ θ ) P(\mathcal X \mid \theta) P(X∣θ)生成的样本 x ( i ) x^{(i)} x(i)组成的集合。
已知概率模型 P ( X ∣ θ ) P(X \mid \theta) P(X∣θ)确定的条件下,样本是生成不完的(可以无限地生成样本)——相反,如果想要预估 P ( X ∣ θ ) P(\mathcal X \mid \theta) P(X∣θ)中的参数 θ \theta θ,只能依靠有限的样本进行估计。因此,有偏估计与无偏估计的定义如下:
如果基于有限的样本求得参数的估计结果 θ ^ \hat \theta θ^和概率模型 P ( X ∣ θ ) P(\mathcal X \mid \theta) P(X∣θ)中的真实参数 θ \theta θ之间没有系统误差——即参数估计结果的期望 E [ θ ^ ] \mathbb E[\hat \theta] E[θ^]是否与概率模型真实参数 θ \theta θ相等:
- 如果两者相等——称
θ
^
\hat \theta
θ^是
θ
\theta
θ的无偏估计。我们也称
θ
^
\hat\theta
θ^具有无偏性:
E [ θ ^ ] = θ \mathbb E[\hat\theta] = \theta E[θ^]=θ - 相反,如果两者不相等——称其为有偏估计:
E [ θ ^ ] ≠ θ \mathbb E[\hat\theta] \neq \theta E[θ^]=θ
无偏性本质上是指对统计量进行预估时,没有 系统误差。系统推断的误差包含系统误差和随机误差两种:
- 我们无论什么样的方法求得的
θ
^
\hat \theta
θ^总是和
P
(
X
∣
θ
)
P(\mathcal X \mid \theta)
P(X∣θ)的真实参数
θ
\theta
θ存在误差。
因为样本是采不完的,因而这个误差消不掉。 - 但如果将这些偏差在概率上平均起来——如果其结果为0,该估计量只有随机误差而没有系统误差。
也就是说,θ ^ \hat \theta θ^与E [ θ ^ ] \mathbb E[\hat \theta] E[θ^]之间是两回事。
实例:
已知一个包含
N
N
N个样本的数据集合
X
=
{
x
(
1
)
,
x
(
2
)
,
.
.
.
,
x
(
N
)
}
\mathcal X = \left\{x^{(1)},x^{(2)},...,x^{(N)} \right\}
X={x(1),x(2),...,x(N)},该集合的各样本均从
(
0
,
100
)
(0,100)
(0,100)的均匀分布中随机获得,并且各样本之间相互独立。
事先计算该数据集合
X
\mathcal X
X的样本均值
θ
\theta
θ:
θ
=
1
N
∑
i
=
1
N
x
(
i
)
\theta = \frac{1}{N} \sum_{i=1}^N x^{(i)}
θ=N1i=1∑Nx(i)
接下来从数据集合
X
\mathcal X
X中随机抽取若干数量的样本
X
X
X并放回,并对
X
X
X取均值作
θ
^
\hat\theta
θ^ 作为一个随机变量,随着抽样次数的增多,
θ
^
\hat\theta
θ^结果也随之增多。最终对随机变量结果求期望
E
(
θ
^
)
E(\hat\theta)
E(θ^)——观察
E
[
θ
^
]
E[\hat\theta]
E[θ^]与
θ
\theta
θ之间的关系。
代码如下:
由于各样本之间独立同分布——因而期望用均值替代;这里使用’增量更新计算‘的方式求解均值——能够观察期望结果的变化过程。
import random
import matplotlib.pyplot as plt
random.seed(1)
a = [round(random.uniform(1,100),2) for i in range(10000)]
mean_a = round(sum(a) / len(a),4)
plt.figure(figsize=(15,4))
sampling_times = 10000
plt.plot([i for i in range(sampling_times)],[mean_a for _ in range(sampling_times)])
sampling_num = 88
u = 0
k = 1
for i in range(sampling_times):
s = random.sample(a,sampling_num)
sampling_res = sum(s) / len(s)
u = u + (1 / k) * (sampling_res - u)
k += 1
plt.scatter(i,u,c="#ff7f0e",s=2)
plt.show()
返回图像结果如下:
蓝色线表示数据集合的样本均值,橙色点表示随机变量X的期望随着随机变量X数量增加的变化情况。
观察图像发现,期望结果从最开始的不稳定,随着抽样次数的增多趋于稳定,并收敛至数据集合
X
\mathcal X
X的均值结果。
关于有偏与无偏估计的误区
- 无偏估计等于在任何时候都能给出正确无误的估计;
- 无偏估计一定存在;
- 无偏估计一定优于有偏估计;
- 无偏估计一定是好估计;
估计量的无偏性确实是一种优秀的性质,但在真实情况下,无偏性的实际价值还是需要具体问题具体分析。
高斯分布中基于参数预测的有偏估计与无偏估计
回顾
在上一节中介绍了使用极大似然估计计算高斯分布的最优参数。
以一维高斯分布为例:
包含
N
N
N个样本的数据集合
X
\mathcal X
X中的样本
x
(
i
)
(
i
=
1
,
2
,
3
,
.
.
.
,
N
)
x^{(i)}(i=1,2,3,...,N)
x(i)(i=1,2,3,...,N)服从一维高斯分布,且各样本间相互独立(其中
μ
,
σ
\mu,\sigma
μ,σ分别为一维高斯分布的均值和标准差):
x
(
i
)
∼
iid
N
(
μ
,
σ
2
)
x^{(i)} \overset{\text{iid}}{\sim} \mathcal N(\mu,\sigma^2)
x(i)∼iidN(μ,σ2)
由于样本间独立同分布,因此:
E
[
x
i
]
=
x
1
+
x
2
+
⋯
+
x
N
N
=
μ
\mathbb E[x_i] = \frac{x_1 + x_2+\cdots +x_N}{N} = \mu
E[xi]=Nx1+x2+⋯+xN=μ
将
X
\mathcal X
X视为从概率模型
P
(
X
;
θ
)
P(X;\theta)
P(X;θ)以参数
θ
\theta
θ生成的样本集合,则确定:
θ
=
(
μ
,
σ
)
\theta = (\mu,\sigma)
θ=(μ,σ)
使用极大似然估计
θ
M
L
E
=
arg
max
θ
log
P
(
X
;
θ
)
\theta_{MLE} = \mathop{\arg\max}\limits_{\theta} \log P(X;\theta)
θMLE=θargmaxlogP(X;θ)求解最优参数
μ
M
L
E
,
σ
M
L
E
2
\mu_{MLE},\sigma_{MLE}^2
μMLE,σMLE2:
μ
M
L
E
=
1
N
∑
i
=
1
N
x
(
i
)
σ
M
L
E
2
=
1
N
∑
i
=
1
N
(
x
(
i
)
−
μ
M
L
E
)
2
\begin{aligned} \mu_{MLE} & = \frac{1}{N}\sum_{i=1}^N x^{(i)} \\ \sigma_{MLE}^2 & = \frac{1}{N} \sum_{i=1}^N (x^{(i)} - \mu_{MLE})^2 \end{aligned}
μMLEσMLE2=N1i=1∑Nx(i)=N1i=1∑N(x(i)−μMLE)2
下面观察
μ
M
L
E
,
σ
M
L
E
\mu_{MLE},\sigma_{MLE}
μMLE,σMLE是有偏估计还是无偏估计。
推导过程
首先观察
μ
M
L
E
\mu_{MLE}
μMLE是否为无偏估计。基于上述关于无偏估计的描述,只要满足:
E
[
μ
M
L
E
]
=
μ
\mathbb E[\mu_{MLE}] = \mu
E[μMLE]=μ
证明过程如下:
关于期望带到公式内部详见:数学期望的性质
E
[
μ
M
L
E
]
=
E
[
1
N
∑
i
=
1
N
x
(
i
)
]
=
1
N
∑
i
=
1
N
E
[
x
(
i
)
]
\begin{aligned} \mathbb E[\mu_{MLE}] & = \mathbb E \left[\frac{1}{N} \sum_{i=1}^N x^{(i)} \right] \\ & = \frac{1}{N} \sum_{i=1}^N \mathbb E[x^{(i)}] \end{aligned}
E[μMLE]=E[N1i=1∑Nx(i)]=N1i=1∑NE[x(i)]
由于
E
[
x
(
i
)
]
=
μ
\begin{aligned}\mathbb E[x^{(i)}] = \mu \end{aligned}
E[x(i)]=μ,并且
μ
\mu
μ是描述概率模型
P
(
X
;
θ
)
P(\mathcal X;\theta)
P(X;θ)的参数,和
i
i
i无关。因此可继续化简为:
=
1
N
∑
i
=
1
N
μ
=
1
N
⋅
N
⋅
μ
=
μ
\begin{aligned} & = \frac{1}{N} \sum_{i=1}^N \mu \\ & = \frac{1}{N} \cdot N \cdot \mu \\ & = \mu \\ \end{aligned}
=N1i=1∑Nμ=N1⋅N⋅μ=μ
至此,我们发现
E
[
μ
M
L
E
]
=
μ
\mathbb E[\mu_{MLE}]= \mu
E[μMLE]=μ,证明极大似然估计关于参数
μ
\mu
μ的预测结果属于无偏估计。
继续观察
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2是否属于无偏估计:即:
E
[
σ
M
L
E
2
]
=
?
σ
2
\mathbb E[\sigma_{MLE}^2] \overset{\text{?}}{=} \sigma^2
E[σMLE2]=?σ2
推导如下:
首先将
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2进行变换:
σ
M
L
E
2
=
1
N
∑
i
=
1
N
(
x
(
i
)
−
μ
M
L
E
)
2
=
1
N
∑
i
=
1
N
{
[
x
(
i
)
]
2
−
2
⋅
x
(
i
)
⋅
μ
M
L
E
+
μ
M
L
E
2
}
=
1
N
∑
i
=
1
N
[
x
(
i
)
]
2
−
1
N
∑
i
=
1
N
2
⋅
x
(
i
)
⋅
μ
M
L
E
+
1
N
∑
i
=
1
N
μ
M
L
E
2
\begin{aligned} \sigma_{MLE}^2 & = \frac{1}{N} \sum_{i=1}^N (x^{(i)} - \mu_{MLE})^2 \\ & = \frac{1}{N} \sum_{i=1}^N \left\{ [x^{(i)}]^2 - 2 \cdot x^{(i)} \cdot\mu_{MLE} + \mu_{MLE}^2 \right \} \\ & = \frac{1}{N}\sum_{i=1}^N [x^{(i)}]^2 - \frac{1}{N} \sum_{i=1}^N2 \cdot x^{(i)} \cdot\mu_{MLE} + \frac{1}{N} \sum_{i=1}^N\mu_{MLE}^2 \end{aligned}
σMLE2=N1i=1∑N(x(i)−μMLE)2=N1i=1∑N{[x(i)]2−2⋅x(i)⋅μMLE+μMLE2}=N1i=1∑N[x(i)]2−N1i=1∑N2⋅x(i)⋅μMLE+N1i=1∑NμMLE2
-
观察第2项:
μ M L E \mu_{MLE} μMLE同样与i i i无关;
1 N ∑ i = 1 N 2 ⋅ x ( i ) ⋅ μ M L E = 2 ⋅ ( 1 N ∑ i = 1 N x ( i ) ) ⋅ μ M L E = 2 ⋅ μ M L E ⋅ μ M L E = 2 ⋅ μ M L E 2 \begin{aligned} \frac{1}{N} \sum_{i=1}^N2 \cdot x^{(i)} \cdot\mu_{MLE} & = 2 \cdot \left(\frac{1}{N} \sum_{i=1}^Nx^{(i)} \right)\cdot\mu_{MLE} \\ & = 2 \cdot \mu_{MLE} \cdot \mu_{MLE} \\ & = 2 \cdot \mu_{MLE}^2 \end{aligned} N1i=1∑N2⋅x(i)⋅μMLE=2⋅(N1i=1∑Nx(i))⋅μMLE=2⋅μMLE⋅μMLE=2⋅μMLE2 -
观察第3项:
1 N ∑ i = 1 N μ M L E 2 = 1 N ⋅ N ⋅ μ M L E 2 = μ M L E 2 \begin{aligned} \frac{1}{N} \sum_{i=1}^N \mu_{MLE}^2 = \frac{1}{N} \cdot N \cdot \mu_{MLE}^2 = \mu_{MLE}^2 \end{aligned} N1i=1∑NμMLE2=N1⋅N⋅μMLE2=μMLE2 -
将3项合并:
σ M L E 2 = 1 N ∑ i = 1 N [ x ( i ) ] 2 − 2 ⋅ μ M L E 2 + μ M L E 2 = 1 N ∑ i = 1 N [ x ( i ) ] 2 − μ M L E 2 \begin{aligned} \sigma_{MLE}^2 & = \frac{1}{N}\sum_{i=1}^N [x^{(i)}]^2 - 2 \cdot \mu_{MLE}^2 + \mu_{MLE}^2 \\ & = \frac{1}{N}\sum_{i=1}^N [x^{(i)}]^2 - \mu_{MLE}^2 \end{aligned} σMLE2=N1i=1∑N[x(i)]2−2⋅μMLE2+μMLE2=N1i=1∑N[x(i)]2−μMLE2
因此,基于上述推导,将
E
[
σ
M
L
E
2
]
\mathbb E[\sigma_{MLE}^2]
E[σMLE2]进行如下变换:
这里添加一些技巧:
E
[
σ
M
L
E
2
]
=
E
{
1
N
∑
i
=
1
N
[
x
(
i
)
]
2
−
μ
M
L
E
2
}
=
E
{
1
N
∑
i
=
1
N
[
x
(
i
)
]
2
−
μ
2
+
μ
2
−
μ
M
L
E
2
}
=
E
{
[
1
N
∑
i
=
1
N
(
x
(
i
)
)
2
−
μ
2
]
−
(
μ
M
L
E
2
−
μ
2
)
}
=
E
[
1
N
∑
i
=
1
N
(
x
(
i
)
)
2
−
μ
2
]
−
E
[
μ
M
L
E
2
−
μ
2
]
\begin{aligned} \mathbb E[\sigma_{MLE}^2] & = \mathbb E \left\{ \frac{1}{N}\sum_{i=1}^N [x^{(i)}]^2 - \mu_{MLE}^2 \right\} \\ & = \mathbb E \left\{ \frac{1}{N}\sum_{i=1}^N [x^{(i)}]^2 - \mu^2 + \mu^2 -\mu_{MLE}^2 \right\} \\ & = \mathbb E \left\{ \left[\frac{1}{N}\sum_{i=1}^N (x^{(i)})^2 - \mu^2 \right] -(\mu_{MLE}^2 - \mu^2) \right\} \\ & = \mathbb E \left[ \frac{1}{N}\sum_{i=1}^N (x^{(i)})^2 - \mu^2 \right] -\mathbb E[\mu_{MLE}^2 - \mu^2] \\ \end{aligned}
E[σMLE2]=E{N1i=1∑N[x(i)]2−μMLE2}=E{N1i=1∑N[x(i)]2−μ2+μ2−μMLE2}=E{[N1i=1∑N(x(i))2−μ2]−(μMLE2−μ2)}=E[N1i=1∑N(x(i))2−μ2]−E[μMLE2−μ2]
- 观察第1项:
μ 2 \mu^2 μ2是常数,因此E [ μ 2 ] = μ 2 \mathbb E[\mu^2] = \mu^2 E[μ2]=μ2;
观察:E { [ x ( i ) ] 2 } − μ 2 = E { [ x ( i ) ] 2 } − ( E [ x ( i ) ] ) 2 = Var ( x ( i ) ) \mathbb E \left\{[x^{(i)}]^2 \right\} - \mu^2 = \mathbb E \left\{[x^{(i)}]^2 \right\} - (\mathbb E[x^{(i)}])^2 = \text{Var}(x^{(i)}) E{[x(i)]2}−μ2=E{[x(i)]2}−(E[x(i)])2=Var(x(i)),即x ( i ) x^{(i)} x(i)的方差;
E [ 1 N ∑ i = 1 N ( x ( i ) ) 2 − μ 2 ] = 1 N ∑ i = 1 N { E [ x ( i ) ] 2 − E [ μ 2 ] } = 1 N ∑ i = 1 N { E [ x ( i ) ] 2 − μ 2 } = 1 N ∑ i = 1 N Var [ x ( i ) ] = σ 2 \begin{aligned} \mathbb E \left[ \frac{1}{N}\sum_{i=1}^N (x^{(i)})^2 - \mu^2 \right] & = \frac{1}{N}\sum_{i=1}^N \left\{\mathbb E [x^{(i)}]^2 - \mathbb E[\mu^2] \right\} \\ & = \frac{1}{N}\sum_{i=1}^N \left\{\mathbb E [x^{(i)}]^2 - \mu^2 \right\} \\ & = \frac{1}{N}\sum_{i=1}^N \text{Var}[x^{(i)}] \\ & = \sigma^2 \end{aligned} E[N1i=1∑N(x(i))2−μ2]=N1i=1∑N{E[x(i)]2−E[μ2]}=N1i=1∑N{E[x(i)]2−μ2}=N1i=1∑NVar[x(i)]=σ2 - 观察第2项:
基于上面无偏估计:E [ μ M L E ] = μ \mathbb E[\mu_{MLE}] = \mu E[μMLE]=μ,将μ \mu μ使用E [ μ M L E ] \mathbb E[\mu_{MLE}] E[μMLE]进行替换;
E [ μ M L E 2 − μ 2 ] = E [ μ M L E 2 ] − E [ μ 2 ] = E [ μ M L E 2 ] − μ 2 = E [ μ M L E 2 ] − E [ μ M L E ] 2 = V a r ( μ M L E ) \begin{aligned} \mathbb E[\mu_{MLE}^2 - \mu^2] & = \mathbb E[\mu_{MLE}^2] - \mathbb E[\mu^2] \\ & = \mathbb E[\mu_{MLE}^2] - \mu^2 \\ & = \mathbb E[\mu_{MLE}^2] - \mathbb E[\mu_{MLE}]^2 \\ & = Var(\mu_{MLE}) \end{aligned} E[μMLE2−μ2]=E[μMLE2]−E[μ2]=E[μMLE2]−μ2=E[μMLE2]−E[μMLE]2=Var(μMLE)
那么
Var
(
μ
M
L
E
)
←
\text{Var}(\mu_{MLE}) \gets
Var(μMLE)← 要如何理解呢?
将其进行展开
这里用到了方差的系数运算:
V
a
r
[
1
N
∑
i
=
1
N
x
(
i
)
]
=
E
[
(
1
N
∑
i
=
1
N
x
(
i
)
)
2
]
−
E
[
1
N
∑
i
=
1
N
x
(
i
)
]
2
=
1
N
2
E
[
(
∑
i
=
1
N
x
(
i
)
)
2
]
−
(
1
N
E
[
∑
i
=
1
N
x
(
i
)
]
)
2
=
1
N
2
E
[
(
∑
i
=
1
N
x
(
i
)
)
2
]
−
1
N
2
(
E
[
∑
i
=
1
N
x
(
i
)
]
)
2
=
1
N
2
[
E
[
(
∑
i
=
1
N
x
(
i
)
)
2
]
−
(
E
[
∑
i
=
1
N
x
(
i
)
]
)
2
]
=
1
N
2
Var
(
∑
i
=
1
N
x
(
i
)
)
\begin{aligned} Var \left[\frac{1}{N}\sum_{i=1}^N x^{(i)} \right] & = \mathbb E \left[ \left(\frac{1}{N}\sum_{i=1}^N x^{(i)} \right)^2 \right] - \mathbb E \left[\frac{1}{N}\sum_{i=1}^N x^{(i)} \right]^2 \\ & = \frac{1}{N^2} \mathbb E \left[ \left(\sum_{i=1}^N x^{(i)} \right)^2 \right] - \left(\frac{1}{N} \mathbb E \left[\sum_{i=1}^N x^{(i)} \right] \right)^2 \\ & = \frac{1}{N^2} \mathbb E[(\sum_{i=1}^N x^{(i)})^2] - \frac{1}{N^2} (\mathbb E[\sum_{i=1}^N x^{(i)}])^2 \\ & = \frac{1}{N^2} [\mathbb E[(\sum_{i=1}^N x^{(i)})^2] - (\mathbb E[\sum_{i=1}^N x^{(i)}])^2] \\ & = \frac{1}{N^2} \text{Var} \left(\sum_{i=1}^N x^{(i)} \right) \end{aligned}
Var[N1i=1∑Nx(i)]=E
(N1i=1∑Nx(i))2
−E[N1i=1∑Nx(i)]2=N21E
(i=1∑Nx(i))2
−(N1E[i=1∑Nx(i)])2=N21E[(i=1∑Nx(i))2]−N21(E[i=1∑Nx(i)])2=N21[E[(i=1∑Nx(i))2]−(E[i=1∑Nx(i)])2]=N21Var(i=1∑Nx(i))
这里用到了方差的加法运算:如果各样本之间相互独立的随机变量,则:
V
a
r
(
X
+
Y
)
=
E
[
(
X
+
Y
)
2
]
−
[
E
(
X
+
Y
)
]
2
=
E
[
X
2
+
2
X
Y
+
Y
2
]
−
(
E
[
X
]
+
E
[
Y
]
)
2
=
E
[
X
2
]
+
2
⋅
E
[
X
]
E
[
Y
]
+
E
[
Y
2
]
−
(
(
E
[
X
]
)
2
+
(
E
[
Y
]
)
2
+
2
⋅
E
[
X
]
E
[
Y
]
)
=
(
E
[
X
2
]
−
(
E
[
X
]
)
2
)
+
(
E
[
Y
2
]
−
(
E
[
Y
]
)
2
)
=
Var
(
X
)
+
Var
(
Y
)
\begin{aligned} Var(X+Y) & = \mathbb E[(X+Y)^2] - [\mathbb E(X+Y)]^2 \\ & = \mathbb E[X^2 + 2XY + Y^2] - (\mathbb E[X]+\mathbb E[Y])^2 \\ & = \mathbb E[X^2] + 2\cdot\mathbb E[X]\mathbb E[Y] + \mathbb E[Y^2] - ((\mathbb E[X])^2 + (\mathbb E[Y])^2 + 2\cdot\mathbb E[X]\mathbb E[Y])\\ & = (\mathbb E[X^2] - (\mathbb E[X])^2) + (\mathbb E[Y^2] - (\mathbb E[Y])^2) \\ & = \text{Var}(X) + \text{Var}(Y) \end{aligned}
Var(X+Y)=E[(X+Y)2]−[E(X+Y)]2=E[X2+2XY+Y2]−(E[X]+E[Y])2=E[X2]+2⋅E[X]E[Y]+E[Y2]−((E[X])2+(E[Y])2+2⋅E[X]E[Y])=(E[X2]−(E[X])2)+(E[Y2]−(E[Y])2)=Var(X)+Var(Y)
因此:
1
N
2
Var
[
∑
i
=
1
N
x
(
i
)
]
=
1
N
2
∑
i
=
1
N
Var
[
x
(
i
)
]
\begin{aligned} \frac{1}{N^2} \text{Var} \left[\sum_{i=1}^N x^{(i)} \right] & = \frac{1}{N^2}\sum_{i=1}^N \text{Var}[x^{(i)}] \end{aligned}
N21Var[i=1∑Nx(i)]=N21i=1∑NVar[x(i)]
回归原式:
Var
(
μ
M
L
E
)
\text{Var}(\mu_{MLE})
Var(μMLE)
Var
[
μ
M
L
E
]
=
Var
[
1
N
∑
i
=
1
N
x
(
i
)
]
=
1
N
2
∑
i
=
1
N
Var
(
x
(
i
)
)
=
1
N
[
1
N
∑
i
=
1
N
Var
(
x
(
i
)
)
]
=
1
N
2
⋅
N
⋅
σ
2
=
1
N
σ
2
\begin{aligned} \text{Var}[\mu_{MLE}] & = \text{Var} \left[ \frac{1}{N}\sum_{i=1}^N x^{(i)} \right] \\ & = \frac{1}{N^2}\sum_{i=1}^N \text{Var}(x^{(i)}) \\ & = \frac{1}{N} \left[\frac{1}{N}\sum_{i=1}^N \text{Var}(x^{(i)}) \right] \\ & = \frac{1}{N^2} \cdot N \cdot \sigma^2 \\ & = \frac{1}{N}\sigma^2 \end{aligned}
Var[μMLE]=Var[N1i=1∑Nx(i)]=N21i=1∑NVar(x(i))=N1[N1i=1∑NVar(x(i))]=N21⋅N⋅σ2=N1σ2
个人理解(小插曲):和视频存在出入的地方:
1
N
∑
i
=
1
N
Var
[
x
(
i
)
]
\begin{aligned}\frac{1}{N}\sum_{i=1}^N \text{Var}[x^{(i)}] \end{aligned}
N1i=1∑NVar[x(i)]中包含
x
(
i
)
x^{(i)}
x(i),使用下面方法求解 不太正确(虽然结果相同)
1
N
∑
i
=
1
N
Var
[
x
(
i
)
]
=
1
N
⋅
N
⋅
σ
2
=
σ
2
\begin{aligned} \frac{1}{N}\sum_{i=1}^N \text{Var}[x^{(i)}] & = \frac{1}{N}\cdot N \cdot \sigma^2 \\ & = \sigma^2 \end{aligned}
N1i=1∑NVar[x(i)]=N1⋅N⋅σ2=σ2
正确的理解方式:
1
N
∑
i
=
1
N
Var
[
x
(
i
)
]
=
1
N
{
Var
[
x
(
i
)
]
+
Var
[
x
(
2
)
]
+
⋯
+
Var
[
x
(
N
)
]
}
=
1
N
[
(
x
(
1
)
−
x
ˉ
1
)
2
+
(
x
2
(
2
)
−
x
ˉ
1
)
2
+
⋯
+
(
x
(
N
)
−
x
ˉ
1
)
2
]
=
1
N
∑
i
=
1
N
(
x
(
i
)
−
x
ˉ
)
2
=
σ
2
\begin{aligned} \frac{1}{N}\sum_{i=1}^N \text{Var}[x^{(i)}] & = \frac{1}{N} \left\{\text{Var}[x^{(i)}] + \text{Var}[x^{(2)}] + \cdots + \text{Var}[x^{(N)}] \right\} \\ & = \frac{1}{N} \left[(\frac{x^{(1)} - \bar{x}}{1})^2 + (\frac{x_2^{(2)} - \bar{x}}{1})^2 + \cdots + (\frac{x^{(N)} - \bar{x}}{1})^2 \right] \\ & = \frac{1}{N}\sum_{i=1}^N(x^{(i)} - \bar{x})^2 \\ & = \sigma^2 \end{aligned}
N1i=1∑NVar[x(i)]=N1{Var[x(i)]+Var[x(2)]+⋯+Var[x(N)]}=N1[(1x(1)−xˉ)2+(1x2(2)−xˉ)2+⋯+(1x(N)−xˉ)2]=N1i=1∑N(x(i)−xˉ)2=σ2
言归正传~
因此,原式 = 第1项 - 第2项 =
σ
2
−
1
N
σ
2
=
N
−
1
N
σ
2
\begin{aligned}\sigma^2 - \frac{1}{N} \sigma^2 = \frac{N-1}{N} \sigma^2\end{aligned}
σ2−N1σ2=NN−1σ2
综上:
E
[
σ
M
L
E
2
]
=
N
−
1
N
σ
2
\mathbb E[\sigma_{MLE}^2] = \frac{N-1}{N} \sigma^2
E[σMLE2]=NN−1σ2
我们发现,使用 极大似然估计 得到
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2的期望不等于
σ
2
\sigma^2
σ2,因此
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2属于有偏估计。
那么实际关于
σ
2
\sigma^2
σ2的无偏估计是多少呢?
σ
^
M
L
E
2
=
1
N
−
1
∑
i
=
1
N
(
x
i
−
μ
M
L
E
)
2
\hat \sigma_{MLE}^2 = \frac{1}{N -1} \sum_{i=1}^N (x_i - \mu_{MLE})^2
σ^MLE2=N−11i=1∑N(xi−μMLE)2
此时,将上述所有关于
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2的推导将
N
→
N
−
1
N \to N-1
N→N−1:
E
[
σ
^
M
L
E
2
]
=
N
−
1
N
−
1
σ
2
=
σ
2
\mathbb E [\hat \sigma_{MLE}^2] = \frac{N-1}{N-1} \sigma^2 = \sigma^2
E[σ^MLE2]=N−1N−1σ2=σ2
为什么 σ M L E 2 \sigma_{MLE}^2 σMLE2会出现偏差
回顾 极大似然估计 下
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2的估计结果:
σ
M
L
E
2
=
1
N
∑
i
=
1
N
(
x
i
−
μ
M
L
E
)
2
\sigma_{MLE}^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu_{MLE})^2
σMLE2=N1i=1∑N(xi−μMLE)2
而真正的无偏结果可以进行如下表示:
σ
2
=
1
N
∑
i
=
1
N
(
x
i
−
μ
)
2
\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2
σ2=N1i=1∑N(xi−μ)2
观察
μ
\mu
μ和
μ
M
L
E
\mu_{MLE}
μMLE:
- μ \mu μ表示描述某一高斯分布的充分统计量(参数);如果高斯分布确定, μ \mu μ值是客观存在的,是确定的;
-
μ
M
L
E
\mu_{MLE}
μMLE是通过 极大似然估计 对
μ
\mu
μ产生的 估计结果;
即: μ M L E ≈ μ \mu_{MLE} \approx \mu μMLE≈μ
只要
μ
M
L
E
\mu_{MLE}
μMLE和
μ
\mu
μ存在差异,
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2就是有偏估计。
需要注意的一点:无偏估计表示估计值的期望是无偏估计,但是实际估计值是存在偏差的。
从理论上讲,若想要得到
μ
\mu
μ,那么必须将概率模型中的所有样本全部取出,进行估计;
如果真能把所有样本全部取出来去估计,那也就不叫估计了,那叫确定~
但是上面提到,一个概率模型确定下来,其样本是生成不完的 → \to → 只能反过来通过有限的样本对模型进行估计。
因此,上述推导过程中,从来没有出现过 直接将 μ M L E \mu_{MLE} μMLE直接替换成 μ \mu μ的情况,而是基于 E [ μ M L E ] = μ \mathbb E[\mu_{MLE}] = \mu E[μMLE]=μ去完成的。
相关参考:
无偏估计 - 百度百科
机器学习-白板推导系列(二)-数学基础
