前言
本文介绍SignSGD与SignUM的非凸收敛性。由于SignSGD原文给出的证明不太好读,本文主要参考[1],其刻画了梯度的$\varphi-norm$的收敛性,一定程度上反映了梯度收敛至一个稳定点的速度。
SignSGD与SignUM
上一篇文章中我们介绍了SignSGD与SignUM:
$$ \begin{align} &\text{SignSGD}:\boldsymbol{\theta}^{(\tau)}=\boldsymbol{\theta}^{(\tau-1)}-\eta^{(\tau)}\cdot\text{sign}(\boldsymbol{g}^{(\tau)})\\ &\text{SignUM}:\begin{cases} \boldsymbol{m}^{(\tau)}=\beta\boldsymbol{m}^{(\tau-1)}+(1-\beta)\boldsymbol{g}^{(\tau)}\\ \boldsymbol{\theta}^{(\tau)}=\boldsymbol{\theta}^{(\tau-1)}-\eta^{(\tau)}\cdot\text{sign}(\boldsymbol{m}^{(\tau)}) \end{cases} \end{align}\tag{1} $$
记号与相关知识回顾
首先回顾一下我们的记号。记满足$k$阶连续可微且其$p(\leq k)$阶导函数Lipschitz连续,即:
$$ \|f^{(p)}(\boldsymbol{x})-f^{(p)}(\boldsymbol{y}) \|\leq L\|\boldsymbol{x} -\boldsymbol{y}\| $$
的所有函数$f$构成的集合为$\mathcal C_{L}^{k,p}$,特别地,若$f\in\mathcal C^{1,1}_L$,则称$f$为Lipschitz光滑,简称$L$-光滑。$||\cdot||$默认指标准内积诱导的$L_2$范数。期望算子$\mathbb E[\cdot]$定义为:$\mathbb{E}[\cdot]=\mathbb{E}_{\boldsymbol{\xi}_{i_1}}\cdots \mathbb{E}_{\boldsymbol{\xi}_{i_{\tau}}} [\cdot]$。
考虑迭代:
$$ \boldsymbol{\theta}^{(\tau)}=\boldsymbol{\theta}^{(\tau-1)}-\eta^{(\tau)}\cdot \boldsymbol{u}^{(\tau)}\tag{2} $$
其在非凸情况下的收敛性我们通常会假定损失函数满足$\mathcal L\in \mathcal C_{L}^{1,1}$,同时考察其收敛至一个稳定点的速度:
$$ \text{Convergence rate:}\ \mathbb{E}\|\nabla \mathcal{L}(\boldsymbol{\theta}^{(\tau)}) \|^2_p\quad or\quad\mathbb{E}\|\boldsymbol{g}^{(\tau)} \|^2_p $$
或者放宽一点考察:
$$ \text{Convergence rate:}\ \min_{k=1,\cdots,\tau} \mathbb{E}\|\nabla \mathcal{L}(\boldsymbol{\theta}^{(k)}) \|^2_p\quad or\quad \min_{k=1,\cdots,\tau} \mathbb{E}\|\boldsymbol{g}^{(\tau)}\|^2_p $$
假设与定义
接下来我们介绍一些证明需要用到的假设与定义。
(假设1) 设随机梯度$\boldsymbol{g}^{(\tau)}$是一个单峰对称分布,且$\mathbb{E}[\boldsymbol{g}^{(\tau)}]=\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)})$,方差$\mathbb{D}[g_i^{(\tau)}]$记为$\left(\sigma^{(\tau)}_i\right)^2$,$g_i^{(\tau)}\neq 0$,且有:
$$ \forall i\in\{0,\cdots,d\},\exists c_i,\ \text{s.t.}\ \left(\sigma^{(\tau)}_i\right)^2\leq c_i\left(g^{(\tau)}_i\right)^2\tag{3} $$
(Success Probability) 记:$\boldsymbol{\varphi}^{(\tau)}=\textbf{P}[\text{sign}(\boldsymbol{g}^{(\tau)})=\text{sign}(\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)}))]$为success probabilities。
(假设2) $\forall i\in \{0,\cdots,d\}$,$\varphi_i^{(\tau)}>\frac 12$。
(假设3) $\mathcal L\in \mathcal C_{L}^{1,1}$,也可等价表述为:$0 \leq \mathcal L(\boldsymbol{\theta_1})-\mathcal L(\boldsymbol{\theta_2}) - \left\langle \nabla \mathcal L(\boldsymbol{\theta_2}),\ \boldsymbol{\theta_1}-\boldsymbol{\theta_2} \right\rangle \leq \frac L2 \left\| \boldsymbol{\theta_1}-\boldsymbol{\theta_2} \right\|^2$
($\varphi-norm$) 定义:$||\boldsymbol{g}^{(\tau)}||_\varphi:=\sum\limits_{i=1}^d (2\varphi_i^{(\tau)}-1)|g_i^{(\tau)}|$,称其为随机梯度$\boldsymbol{g}^{(\tau)}$的$\varphi$-“范数”。
引理
然后我们介绍证明需要用到的引理。引理的证明就不展开了,感兴趣的读者可以参阅原论文。
(引理1) $\forall i \in \{0,\cdots,d\}$,有:
$$ \varphi_i^{(\tau)}\geq \frac12 +\frac{1}{2}\frac{|g_i^{(\tau)}|}{|g_i^{(\tau)}|+\sqrt 3\sigma_i^{(\tau)}}>\frac 12\tag{4} $$
(引理2) 在Mini-batch的框架下,若batch size满足$b>2\max_i c_i$,则有:
$$ \varphi_i^{(\tau)}\geq 1-\frac{c_i}{b}>\frac 12\tag{5} $$
(引理3) 设$\left(\nu_i^{(\tau)} \right)^3=\mathbb{E}[(g_i^{(\tau)}-\mathbb{E}[g_i^{(\tau)}])^3]$为随机梯度$g_i^{(\tau)}$的三阶中心距,则若batch size满足:
$$ b>2\min\left\{\frac{\left(\sigma_i^{(\tau)} \right)^2}{\left(g_i^{(\tau)} \right)^2},\frac{\left(\nu_i^{(\tau)} \right)^3}{|g_i^{(\tau)} |\left(\sigma_i^{(\tau)} \right)^2} \right\}\tag{6} $$
则$\varphi_i^{(\tau)}>\frac 12$。
在证明之前
在正式的证明之前,我们需要来分析分析所谓的$\varphi-norm$究竟是什么,以及我们为什么需要这个定义。代入引理1到$\varphi-norm$的定义中,则有:
$$ \begin{align} ||\boldsymbol{g}^{(\tau)}||_\varphi&=\sum\limits_{i=1}^d (2\varphi_i^{(\tau)}-1)|g_i^{(\tau)}|\\ &\geq\sum\limits_{i=1}^d \frac{\left|g_i^{(\tau)} \right|^2}{(1+\sqrt{3c_i})\left|g_i^{(\tau)} \right|} \end{align}\tag{7} $$
显然这本质上是缩放后的$L_1$范数。同时如果迭代到了某个稳定点附件,即$\max_{1\leq i\leq d}|g_i^{(\tau)}|\leq \tilde c$,则有:
$$ ||\boldsymbol{g}^{(\tau)}||_\varphi\geq ||\boldsymbol{g}^{(\tau)}||_2^2\cdot\sum_{i=1}^d \frac{1}{(1+\sqrt{3c_i})\tilde c}\tag{8} $$
所以$\varphi$同时刻画了$L_1$与$L_2$的性质,因此如果我们考察$||\boldsymbol{g}^{(\tau)}||_\varphi$是合理的。现在我们搞懂了$\varphi-norm$的含义,那么还剩下一个核心问题:我们为什么需要它? 我们首先来回顾SGD在L-光滑非凸情况下的收敛性证明过程。由L-光滑:
$$ \mathcal L(\boldsymbol{\theta}^{(\tau)})\leq\mathcal L(\boldsymbol{\theta}^{(\tau-1)})+\left\langle \nabla\mathcal L(\boldsymbol{\theta}^{(\tau)}),\boldsymbol{\theta}^{(\tau)}-\boldsymbol{\theta}^{(\tau-1)} \right\rangle+\frac L2 \|\boldsymbol{\theta}^{(\tau)}-\boldsymbol{\theta}^{(\tau-1)} \|^2\tag{9} $$
代入SGD的参数迭代式:$\boldsymbol{\theta}^{(\tau)}=\boldsymbol{\theta}^{(\tau-1)}-\eta^{(\tau)}\cdot\boldsymbol{g}^{(\tau)}$并在不等式两边对第$\tau$次迭代的随机性变量$\boldsymbol{\xi}_{i_\tau}$取期望,则有:
$$ \mathbb{E}_{\boldsymbol{\xi}_{i_\tau}}[\mathcal L(\boldsymbol{\theta}^{(\tau)})]\leq \mathcal L(\boldsymbol{\theta}^{(\tau-1)})-\eta^{(\tau)}\cdot {\color{Red}\mathbb{E}_{\boldsymbol{\xi}_{i_\tau}}\left[\left\langle\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)}),\boldsymbol{g}^{(\tau)} \right\rangle\right]}+\frac{L\left(\eta^{(\tau)} \right)^2}{2}{\color{Cyan}\mathbb{E}_{\boldsymbol{\xi}_{i_\tau}} ||\boldsymbol{g}^{(\tau)}||^2}\tag{10} $$
这样我们便自然地得到了$||\boldsymbol{g}^{(\tau)}||^2$,可以很方便地刻画其收敛性,如果对随机梯度的方差进行一些假设也可以得到$||\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)})||^2$的收敛性。然而我们把SignSGD的迭代式(1)代入,首先最后的$||\boldsymbol{g}^{(\tau)}||^2$不存在了,这是由于$\text{sign}^2(\boldsymbol{g}^{(\tau)})\equiv \boldsymbol{1}$;而且中间的点积变成了:$\mathbb{E}_{\boldsymbol{\xi}_{i_\tau}}\left[\left\langle \nabla\mathcal L(\boldsymbol{\theta}^{(\tau)}),\text{sign}(\boldsymbol{g}^{(\tau)}) \right\rangle\right]$,也即:
$$ \mathbb{E}_{\boldsymbol{\xi}_{i_\tau}}[\mathcal L(\boldsymbol{\theta}^{(\tau)})]\leq \mathcal L(\boldsymbol{\theta}^{(\tau-1)})-\eta^{(\tau)}\cdot \mathbb{E}_{\boldsymbol{\xi}_{i_\tau}}\left[\left\langle\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)}),\text{sign}(\boldsymbol{g}^{(\tau)}) \right\rangle\right]+\frac{dL}{2}\left(\eta^{(\tau)} \right)^2\tag{11} $$
那么此时我们该刻画谁的收敛性呢?在引入$\varphi-norm$后,我们有一个核心结论:
事实上这个结论的证明也并不复杂:($\text{sign}(g_i^{(\tau)})$是一个两点分布)
$$ \begin{align} \mathbb{E}_{\boldsymbol{\xi}_{i_\tau}}\left[\left\langle \nabla\mathcal L(\boldsymbol{\theta}^{(\tau)}),\text{sign}(\boldsymbol{g}^{(\tau)}) \right\rangle\right]&=\sum_{i=1}^d\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)})_i\cdot\mathbb{E}_{\boldsymbol{\xi}_{i_\tau}}[\text{sign}(g_i^{(\tau)})]\\ &=\sum_{i=1}^d \nabla\mathcal L(\boldsymbol{\theta}^{(\tau)})_i\cdot\left[\varphi_i^{(\tau)}\text{sign}(\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)})_i)+(1-\varphi_i^{(\tau)})(-\text{sign}(\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)})_i)) \right]\\ &=\sum_{i=1}^d (2\varphi_i^{(\tau)}-1)|g_i^{(\tau)}|=||\boldsymbol{g}^{(\tau)}||_\varphi \end{align} $$
SignSGD非凸收敛性
将引理4代入(11),再取全期望则有:
$$ \sum_{k=1}^\tau \eta^{(k)}\mathbb{E}||\boldsymbol{g}^{(k)} ||_\varphi\leq \epsilon(0)+\frac{dL}{2}\sum_{k=1}^\tau \left(\eta^{(\tau)} \right)^2\tag{12} $$
于是我们有如下结论:
学习率 | $\mathbb{E} | \boldsymbol{g}^{(\tau)} | _\varphi$ | 阶数 | ||
---|---|---|---|---|---|---|
固定,$\eta^{(\tau)}\equiv\eta>0$ | $\frac{\epsilon(0)}{\eta\tau}+\frac{\eta dL}{2}$ | $O(\frac{1}{\tau})+C$ | ||||
衰减,$\eta^{(\tau)}=\frac{\eta}{\sqrt\tau}$ | $\frac{2\epsilon(0)+dL\eta^2(1+\ln \tau)}{2\eta\sqrt \tau}$ | $O(\frac{\ln \tau}{\sqrt{\tau}})$ | ||||
衰减,$\eta^{(\tau)}=\frac \eta{\tau}$ | $O(\frac{\epsilon(0)+\pi^2 dL\eta^2/12}{\ln \tau})$ | $O(\frac{1}{\ln \tau})$ |
具体的证明留给读者完成。提示:
$$ \sum_{k=1}^\tau \frac{1}{\sqrt k}\geq \sqrt\tau,\ \ln\tau\leq \sum_{k=1}^\tau \frac 1\tau\leq 1+\ln\tau,\ \lim_{\tau\to \infty}\sum_{k=1}^\tau \frac{1}{k^2}=\frac {\pi^2} {6} $$
SignUM非凸收敛性
我们再来看SignUM的收敛性,同样由L-光滑:
$$ \begin{align} \mathcal L(\boldsymbol{\theta}^{(\tau)})&\leq \mathcal L(\boldsymbol{\theta}^{(\tau-1)})-\eta^{(\tau)}\cdot \left\langle\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)}),\text{sign}({\color{Red}\boldsymbol{m}^{(\tau)}}) \right\rangle+\frac{dL}{2}\left(\eta^{(\tau)} \right)^2\\ &\leq \mathcal L(\boldsymbol{\theta}^{(\tau)})-\eta^{(\tau)}\cdot{\color{Red}\frac{\left\langle\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)}),\boldsymbol{m}^{(\tau)} \right\rangle}{||\boldsymbol{m}^{(\tau)}||}}+\frac{dL}{2}\left(\eta^{(\tau)} \right)^2 \end{align}\tag{13} $$
我们引入一个引理:
(引理5) 设有$\boldsymbol{a},\boldsymbol{b}\in \mathbb{R}^d\neq \boldsymbol{0}$,则:
$$ -\frac{\langle \boldsymbol{a}, \boldsymbol{b}\rangle}{\|\boldsymbol{a}\|} \leq-\frac{1}{3}\|\boldsymbol{b}\|+\frac{8}{3}\|\boldsymbol{a}-\boldsymbol{b}\|\tag{14} $$
所以我们有:
$$ \mathcal L(\boldsymbol{\theta}^{(\tau)})\leq \mathcal L(\boldsymbol{\theta}^{(\tau-1)})-\frac{\eta^{(\tau)}}{3}\|\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)})\|+\frac{8\eta^{(\tau)}}{3}\|\boldsymbol{m}^{(\tau)}-\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)}) \|+\frac{dL}{2}\left(\eta^{(\tau)} \right)\tag{15} $$
我们现在的关键是如何对$\|\boldsymbol{m}^{(\tau)}- \nabla \mathcal L(\boldsymbol{\theta}^{(\tau)}) \|$进行放缩。我们记:$\tilde {\boldsymbol{\epsilon}}^{(\tau)}=\boldsymbol{m}^{(\tau)}- \nabla \mathcal L(\boldsymbol{\theta}^{(\tau)})$,$\boldsymbol{\epsilon}^{(\tau)}=\boldsymbol{g}^{(\tau)}-\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)})$,注意到:
$$ \begin{align} \tilde{\boldsymbol{\epsilon}}^{(\tau+1)}&=\beta\boldsymbol{m}^{(\tau)}+(1-\beta)\boldsymbol{g}^{(\tau+1)}-\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)})\\ &=\beta \tilde{\boldsymbol{\epsilon}}^{(\tau)}+(1-\beta)\boldsymbol{\epsilon}^{(\tau)}+\beta(\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)})-\nabla\mathcal L(\boldsymbol{\theta}^{(\tau+1)}))\\ &=\beta^\tau\tilde{\boldsymbol{\epsilon}}^{(1)}+(1-\beta)\sum_{k=1}^\tau \beta^{k-1}\boldsymbol{\epsilon}^{(\tau-k+1)} +\beta\sum_{k=1}^\tau \beta^{k-1}(\nabla\mathcal L(\boldsymbol{\theta}^{(\tau-k)})-\nabla\mathcal L(\boldsymbol{\theta}^{(\tau-k+1)})) \end{align}\tag{16} $$
为了进一步放缩,我们需要引入假设:
(假设4) $\mathbb{E}\|\boldsymbol{g}^{(\tau)}-\nabla \mathcal L(\boldsymbol{\theta}^{(\tau)}) \|^2\leq \sigma^2$。
假设4的一个直接推论是:
$$ \mathbb{E}\left[\left\langle\boldsymbol{\epsilon}^{(k)}, \boldsymbol{\epsilon}^{(k')}\right\rangle\right]=\begin{cases} \sigma^2&,k=k'\\ 0&,k\neq k' \end{cases}\tag{17} $$
由L-光滑的假设:
$$ \|\nabla\mathcal L(\boldsymbol{\theta}^{(\tau-1)})-\nabla\mathcal L(\boldsymbol{\theta}^{(\tau)}) \|\leq L\|\boldsymbol{\theta}^{(\tau-1)}-\boldsymbol{\theta}^{(\tau)} \|=L\eta^{(\tau)}\|\text{sign}(\boldsymbol{m}^{(\tau)}) \|\leq L\eta^{(\tau)}\sqrt{d}\tag{18} $$
现在对(16)式等式两边同时取$L_2$范数:($\tilde{\boldsymbol{\epsilon}}^{(1)}=\boldsymbol{\epsilon}^{(1)}$)
$$ \begin{align} \mathbb{E}\|\tilde{\boldsymbol{\epsilon}}^{(\tau)} \|&\leq\beta^\tau\sigma+L\sqrt{d}\sum_{k=1}^\tau \beta^k\eta^{(\tau-k+1)}+(1-\beta)\sqrt{\mathbb{E}\left\|\sum_{k=1}^\tau \beta^{k-1}\boldsymbol{\epsilon}^{(\tau-k+1)}\right\|^2}&代入(18)\\ &\leq \beta^\tau\sigma+L\sqrt{d}\sum_{k=1}^\tau \beta^k\eta^{(\tau-k+1)}+(1-\beta)\sqrt{\sum_{k=1}^\tau \beta^{2k}\sigma^2}&代入(17)\\ &\leq \beta^\tau\sigma+L\sqrt{d}\sum_{k=1}^\tau \beta^k\eta^{(\tau-k+1)}+\sigma\beta\sqrt{\frac{1-\beta}{1+\beta}}\\ &\leq \beta^\tau\sigma+\frac{L\sqrt{d}\beta}{1-\beta}+\sigma\beta\sqrt{\frac{1-\beta}{1+\beta}}&\eta^{(\tau)}<1 \end{align} $$
再对(15)式累和则得到了最后的结论,取:$\beta^{(\tau)}=1-\frac{1}{\tau},\eta^{(\tau)}=\frac{1}{\tau^{3/4}}$,则:
$$ \min_{1\leq k\leq \tau}\mathbb{E}\|\nabla\mathcal L(\boldsymbol{\theta}^{(k)}) \|\leq O(\frac{C+\sqrt{d}}{\tau^{\frac 14}}) $$
References
- Stochastic Sign Descent Methods: New Algorithms and Better Theory,Mher Safaryan, Peter Richtárik,ICML 2019,2019 [[ICML PDF]](http://proceedings.mlr.press/v139/safaryan21a/safaryan21a.pdf) [[Newest Version PDF]](https://arxiv.org/pdf/1905.12938.pdf)