梯度下降算法综述
今天要和<span style="color: black;">大众</span>分享的是梯度下降算法的综述,<span style="color: black;">咱们</span>将结合2016年的一篇梯度下降算法的综述<span style="color: black;">文案</span><span style="color: black;">An </span><span style="color: black;">overview of gradient descent optimizationalgorithms</span><span style="color: black;">进行介绍并<span style="color: black;">这里</span><span style="color: black;">基本</span>上进行<span style="color: black;">必定</span>的分析和<span style="color: black;">弥补</span>。</span>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">背景介绍</span></h2>梯度下降算法最经典的优化算法之一,在最优化<span style="color: black;">行业</span>占据<span style="color: black;">非常</span>重要的地位。它最早被柯西<span style="color: black;">首要</span>提出,是最<span style="color: black;">基本</span>的一阶优化算法。假设<span style="color: black;">咱们</span>的<span style="color: black;">目的</span>寻找光滑<span style="color: black;">目的</span>函数的极小值点,其中是模型的参数。梯度下降算法的更新公式为其中,<span style="color: black;">暗示</span>算法在第次迭代中得到的数值解,<span style="color: black;">暗示</span><span style="color: black;">目的</span>函数关于参数的一阶导数(即梯度),超参数<span style="color: black;"><span style="color: black;">0" data-formula-type="inline-equation" style="visibility: visible;"></span></span><span style="color: black;">叫作</span>为步长(step size)或学习率(learning rate)。<span style="color: black;">所说</span>梯度下降,<span style="color: black;">便是</span>沿着<span style="color: black;">目的</span>函数的负梯度方向(当前点<span style="color: black;">目的</span>函数值下降趋势最大的方向)前进搜索极小值点,走多远由学习率参数决定,见图1。<span style="color: black;">因为</span>在极小值点处梯度为0,<span style="color: black;">因此呢</span>直觉上梯度下降算法<span style="color: black;">最后</span>会停在梯度为0的点,而这个点在<span style="color: black;">必定</span>假设<span style="color: black;">要求</span>下<span style="color: black;">便是</span>极小值点。由此,产生了一系列和梯度下降算法<span style="color: black;">关联</span>的理论问题,如收敛速率(convergence rate)、学习率的<span style="color: black;">选择</span>以及<span style="color: black;">怎样</span>加速算法的收敛等。<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/1y1ObuUF34zXAnmw5JaaHDmSeGK4H91L8VBQhUBoZWdDkPJqxFINZ8UGCbusce1mM9TSSj6fiafEI9OWic0DicQGg/640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">图1:梯度下降算法示意图</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">随着近年来深度学习<span style="color: black;">行业</span>的蓬勃兴起,梯度类算法<span style="color: black;">作为</span>了优化神经网络的最重要的<span style="color: black;">办法</span>。<span style="color: black;">各样</span>深度学习的框架(如Caffe, Keras, TensorFlow and PyTorch)中,均支持<span style="color: black;">各样</span>各样的梯度类优化器。梯度类<span style="color: black;">办法</span>在深度学习中如此重要的<span style="color: black;">重点</span><span style="color: black;">原由</span>是深度神经网络中数以百万、千万级别的待优化参数使得诸如牛顿法和拟牛顿法等高阶优化<span style="color: black;">办法</span><span style="color: black;">再也不</span>可行,而梯度类<span style="color: black;">办法</span>借助GPU计算资源的支持<span style="color: black;">能够</span>进行<span style="color: black;">有效</span>的运算。然而天下<span style="color: black;">无</span>免费的午餐,当<span style="color: black;">咱们</span><span style="color: black;">运用</span>一阶优化算法获取其计算可行性上的好处时,付出的代价便是一阶算法较慢的收敛速率。尤其是当<span style="color: black;">目的</span>函数接近病态系统时,梯度类算法的收敛速度会变得更加缓慢。除此之外,<span style="color: black;">因为</span>深度神经网络的损失函数极其<span style="color: black;">繁杂</span>且非凸,梯度类算法优化得到的模型参数的性质未知,依赖于算法的参数<span style="color: black;">选择</span>和初值的<span style="color: black;">选择</span>。<span style="color: black;">因此呢</span>,<span style="color: black;">日前</span>深度学习中的梯度类算法<span style="color: black;">实质</span>上是某种<span style="color: black;">道理</span>上的黑箱算法,其可解释性仍然面临着巨大挑战。</p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">梯度下降算法的变体</span></h2>在<span style="color: black;">第1</span>部分的背景介绍中,<span style="color: black;">咱们</span>将<span style="color: black;">叫作</span>为<span style="color: black;">目的</span>函数,<span style="color: black;">这儿</span>的是一个确定的函数形式,例如. 在<span style="color: black;">实质</span>的<span style="color: black;">设备</span>学习问题中,<span style="color: black;">咱们</span><span style="color: black;">常常</span><span style="color: black;">思虑</span>随机优化问题。<span style="color: black;">详细</span><span style="color: black;">来讲</span>,假设服从某一分布,给定损失,其中是<span style="color: black;">咱们</span>感兴趣的参数。参数的真值满足如下的等式:<span style="color: black;">因为</span>分布是未知的,<span style="color: black;">以上</span>优化问题<span style="color: black;">没法</span>直接被计算。假设<span style="color: black;">咱们</span>获取了一组观测样本(<span style="color: black;">常常</span>假设独立同分布),<span style="color: black;">那样</span><span style="color: black;">咱们</span>转而<span style="color: black;">思虑</span><span style="color: black;">危害</span>极小化问题:此时<span style="color: black;">一般</span>被<span style="color: black;">叫作</span>为损失函数,例如平方损失、交叉熵损失等等。本身是的一个估计量,而<span style="color: black;">咱们</span>的<span style="color: black;">经过</span>算法寻找的其实是。相比较于确定性<span style="color: black;">目的</span>函数,<span style="color: black;">这儿</span>的损失函数引入了一个新的超参数,即样本量。而计算梯度时<span style="color: black;">运用</span>的样本量<span style="color: black;">区别</span>,带来了<span style="color: black;">区别</span>的梯度下降算法。<h3 style="color: black; text-align: left; margin-bottom: 10px;">批量梯度下降</h3>批量梯度下降(Batch gradient descent, BGD),某些文献中<span style="color: black;">亦</span><span style="color: black;">叫作</span>为(Full gradient, FG),其本质<span style="color: black;">便是</span><span style="color: black;">运用</span>全样本数据计算梯度。假设参数,全样本为,则BGD算法单次迭代的<span style="color: black;">繁杂</span>度约为。BGD算法的优缺点<span style="color: black;">非常</span>鲜明。其<span style="color: black;">重点</span>优点在于理论性质良好,<span style="color: black;">保准</span>能够找到极小值点。<span style="color: black;">倘若</span>损失函数是凸(强凸)函数,BGD算法能以次线性(线性)的收敛速率找到全局极小值点。缺点则是,当样本量很大时计算效率很低。例如计算机视觉中的ImageNet 数据集,训练样本量超过一百万,即使忽略的影响<span style="color: black;">亦</span>是非常巨大的计算成本。<span style="color: black;">另外</span>,BGD算法<span style="color: black;">因为</span>要<span style="color: black;">运用</span>全样本数据,<span style="color: black;">亦</span>不适用于在线学习(Online learning)的情形。<h3 style="color: black; text-align: left; margin-bottom: 10px;">随机梯度下降</h3>随机梯度下降(Stochastic gradient descent, SGD)与BGD<span style="color: black;">区别</span>,SGD每次更新时仅仅是用一个样本来计算梯度,<span style="color: black;">因此呢</span>SGD算法单次迭代的<span style="color: black;">繁杂</span>度约为,与BGD相比计算效率大大<span style="color: black;">提高</span>,<span style="color: black;">况且</span>SGD适用于在线学习的情形,来一个样本就<span style="color: black;">能够</span><span style="color: black;">经过</span>SGD更新一次。SGD付出的代价在于它在梯度中引入了额外的方差。尽管在给定全样本的<span style="color: black;">要求</span>下,单个样本计算的梯度是全样本的无偏估计,<span style="color: black;">然则</span>单个样本梯度方向很难和全样本方向一致,<span style="color: black;">因此呢</span>SGD的更新是极其波动的。在常数学习率下,SGD算法即使在损失函数为凸(强凸)函数的<span style="color: black;">要求</span>下<span style="color: black;">亦</span><span style="color: black;">不可</span>准确找到极小值点,<span style="color: black;">最后</span>的解和极小值点之间的距离受到学习率和梯度方差上界的影响。<h3 style="color: black; text-align: left; margin-bottom: 10px;">小批量梯度下降</h3>小批量梯度下降(Mini-batch gradient descent, MGD)<span style="color: black;">能够</span>看做是BGD和SGD的一种折中的<span style="color: black;">选取</span>,它<span style="color: black;">亦</span>是<span style="color: black;">日前</span>在深度学习中<span style="color: black;">实质</span>上<span style="color: black;">运用</span>最为广泛的梯度下降算法,许多<span style="color: black;">文案</span>中所指的SGD<span style="color: black;">实质</span>上为MGD算法。每次更新时<span style="color: black;">运用</span>样本量为batch size, 的样本来计算梯度,<span style="color: black;">因此呢</span>MGD算法单次迭代的<span style="color: black;">繁杂</span>度约为,与BGD相比计算效率有所<span style="color: black;">提高</span>,<span style="color: black;">同期</span>引入的梯度噪声方差<span style="color: black;">少于</span>SGD,稳定性更好。当然MGD算法<span style="color: black;">亦</span>有自己的问题,它引入了新的超参数。在<span style="color: black;">实质</span>深度学习任务中,学者<span style="color: black;">发掘</span>的<span style="color: black;">选择</span>不仅影响算法的收敛效率,<span style="color: black;">一样</span>影响<span style="color: black;">最后</span>的外样本精度。<span style="color: black;">通常</span>而言的取值范围在之间。最后,<span style="color: black;">咱们</span>给出三种<span style="color: black;">运用</span>不<span style="color: black;">一样</span>本量的梯度下降算法的对比示意图。<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/1y1ObuUF34zXAnmw5JaaHDmSeGK4H91LpPNI9X9q0IjfbAWl0w28TcmVH14Vt7rE4MWbZJtD8tWXJVSvfOyYxA/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">图2:三种梯度下降算法的对比示意图</p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">梯度下降算法的改进算法</span></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">如前所说,梯度下降算法<span style="color: black;">做为</span>一阶算法有其固有的局限性,仅仅<span style="color: black;">运用</span>损失函数的一阶信息进行更新是比较“短视”的做法,在病态系统下收敛速度极为缓慢。学者们<span style="color: black;">亦</span>提出了一系列<span style="color: black;">办法</span>来加速梯度下降算法,<span style="color: black;">重点</span>的思路是对梯度下降法的两个<span style="color: black;">重点</span><span style="color: black;">构成</span>部分进行修改,即更新方向和学习率。<span style="color: black;">咱们</span>接下来<span style="color: black;">重点</span>介绍其中几种<span style="color: black;">非常</span>重要的算法。请<span style="color: black;">重视</span>,以下算法原始版本均是在非随机优化或BGD的<span style="color: black;">要求</span>下讨论的,<span style="color: black;">然则</span>在<span style="color: black;">实质</span>深度学习的应用中<span style="color: black;">常常</span>是<span style="color: black;">运用</span>了MGD的版本,<span style="color: black;">咱们</span>在下文的介绍中并不区分这些细节。</p>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">动量梯度下降</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">动量梯度下降(Gradient Descent with Momentum)法 ,<span style="color: black;">亦</span>被<span style="color: black;">叫作</span>为重球(Heavy Ball, HB)<span style="color: black;">办法</span>,是对梯度下降算法的经典改进算法,其更新公式如下:</p>其中被<span style="color: black;">叫作</span>为动量参数,文献中的<span style="color: black;">举荐</span>值为0.9。它的核心思想是对梯度下降的更新方向进行<span style="color: black;">调节</span>,<span style="color: black;">思虑</span><span style="color: black;">运用</span>历史梯度信息的指数加权平均<span style="color: black;">做为</span>参数更新的方向。动量梯度下降算法<span style="color: black;">重点</span>带来了两个方面的改进:(1) 加速梯度下降算法。动量一词源于<span style="color: black;">理学</span>学,<span style="color: black;">咱们</span><span style="color: black;">能够</span>很形象的理解为物体沿山坡下滑的过程中的任意时刻的速度<span style="color: black;">能够</span>分解为当前位置的坡度下降的方向(当前梯度方向)和物体的惯性(历史速度方向)的矢量和。<span style="color: black;">因此呢</span>当<span style="color: black;">咱们</span><span style="color: black;">思虑</span>历史梯度信息后,梯度下降算法会下降得更快;(2) <span style="color: black;">控制</span>震荡。梯度下降算法会受到损失函数海森矩阵的<span style="color: black;">要求</span>数(海森矩阵的最大特征根与最小特征根的比值)的影响 ,<span style="color: black;">要求</span>数越大海森矩阵的病态程度越高,此时梯度方向对参数空间的某些方向极度<span style="color: black;">敏锐</span>,<span style="color: black;">咱们</span>能够观察到参数的更新路径震荡剧烈。动量梯度下降法能够累计在正确前进方向的梯度,并且抵消部分<span style="color: black;">敏锐</span>方向的震荡幅度。在理论<span style="color: black;">科研</span>方面,<span style="color: black;">能够</span>证明动量梯度下降法<span style="color: black;">重点</span>是修正了海森矩阵<span style="color: black;">要求</span>数的影响,即将最优收缩系数中的变为了,从而加速收敛。<h3 style="color: black; text-align: left; margin-bottom: 10px;">Nesterov动量梯度</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Nesterov动量梯度(Nesterov accelerated gradient, NAG)<span style="color: black;">办法</span> ,是动量梯度下降法的改进版本,其更新公式如下:</p>其中动量参数的<span style="color: black;">举荐</span>值仍为0.9。NAG<span style="color: black;">思虑</span>的是一个更加“聪明”的小球,它并不是短视地<span style="color: black;">思虑</span>当前时刻的梯度,而是超前<span style="color: black;">思虑</span><span style="color: black;">将来</span>时刻的梯度,即<span style="color: black;">根据</span>动量方向前进至处的梯度。当梯度方向<span style="color: black;">出现</span>变化时,动量梯度的纠正机制<span style="color: black;">常常</span>需要累积几步,而NAG能够更早的进行纠正。这种超前的思路使得NAG能够进一步<span style="color: black;">控制</span>震荡,从而加速收敛。当损失函数为凸函数时,<span style="color: black;">能够</span>证明NAG算法的收敛速率由<span style="color: black;">提高</span>为。<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/1y1ObuUF34zXAnmw5JaaHDmSeGK4H91LX4ZbKrsBViaGpbq3tap4tjddooNdE6ePQ3SicZSaBMrQg0F9icjibj9sYw/640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">图3:动量梯度下降与NAG的对比示意图</p>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">AdaGrad</h3>AdaGrad从学习率的<span style="color: black;">方向</span><span style="color: black;">思虑</span>加速梯度下降算法,传统的梯度下降算法之<span style="color: black;">因此</span>受制于病态系统的影响,归根结底是<span style="color: black;">由于</span>所有的参数共享了相同的学习率。为<span style="color: black;">保准</span>算法收敛,学习率受制于海森矩阵特征值<span style="color: black;">很强</span>的参数方向而变得很小,<span style="color: black;">引起</span>其他方向更新步长过小优化缓慢。<span style="color: black;">倘若</span><span style="color: black;">咱们</span>能够给<span style="color: black;">每一个</span>参数赋予<span style="color: black;">区别</span>的学习率,<span style="color: black;">那样</span>就有可能<span style="color: black;">极重</span>地加速梯度下降算法的收敛。<span style="color: black;">同期</span>,AdaGrad能够在迭代过程中自动<span style="color: black;">调节</span>参数的学习率,这被<span style="color: black;">叫作</span>为自适应学习率(Adaptive learning rate)<span style="color: black;">办法</span>。其更新公式如下:<span style="color: black;">这儿</span><span style="color: black;">暗示</span>对应元素相乘,根号<span style="color: black;">亦</span>是<span style="color: black;">功效</span>在向量的每一个元素上,是为了<span style="color: black;">保准</span>除法的数值稳定性添加的小量。<span style="color: black;">能够</span>看到AdaGrad记录了历史梯度的逐元素累积平方和,并<span style="color: black;">运用</span>该累积和的平方根的逆<span style="color: black;">做为</span>学习率权重。AdaGrad的<span style="color: black;">重点</span>贡献在于<span style="color: black;">能够</span>进行学习率的自适应<span style="color: black;">调节</span>,<span style="color: black;">同期</span>对梯度下降法略有加速。<span style="color: black;">然则</span>它<span style="color: black;">亦</span>有很<span style="color: black;">显著</span>的缺点,更新公式的分母是所有历史信息的求和,<span style="color: black;">因此呢</span>会随着迭代变得越来越大,从而使得学习率衰减过快。AdaGrad算法在<span style="color: black;">实质</span>操作中<span style="color: black;">常常</span>会<span style="color: black;">显现</span>算法找到极小值点之前提前停止的<span style="color: black;">状况</span>。<h3 style="color: black; text-align: left; margin-bottom: 10px;">AdaDelta</h3>AdaDelta是AdaGrad的延伸和改进版本,它<span style="color: black;">重点</span>是为<span style="color: black;">认识</span>决Adagrad中历史梯度累积平方和单调递增的问题。AdaDelta<span style="color: black;">再也不</span><span style="color: black;">运用</span><span style="color: black;">所有</span>历史信息,而是提出<span style="color: black;">运用</span>某个固定窗宽内的历史梯度信息计算累计平方和。<span style="color: black;">因为</span>计算固定窗宽内的梯度累积平方和需要存储个历史梯度平方的信息,AdaDelta转而<span style="color: black;">运用</span>指数加权的方式累积历史信息:其中指数加权参数的<span style="color: black;">举荐</span>值为0.9。<span style="color: black;">从而</span>有迭代公式:作者在<span style="color: black;">文案</span>中指出,之前的梯度类算法(<span style="color: black;">包含</span>原始梯度下降、动量梯度下降和AdaGrad)更新公式中参数的单位并<span style="color: black;">无</span>一致。<span style="color: black;">这儿</span><span style="color: black;">详细</span><span style="color: black;">指的是</span>,和各自有自己的单位和尺度,在之前的算法里并<span style="color: black;">无</span><span style="color: black;">思虑</span>这个问题。作者<span style="color: black;">思虑</span>修正这个问题,<span style="color: black;">因此呢</span>AdaDelta<span style="color: black;">最后</span>的更新公式变为:<span style="color: black;">能够</span>看到,分子<span style="color: black;">运用</span><span style="color: black;">保准</span>了单位的一致性,<span style="color: black;">同期</span>代替了学习率的<span style="color: black;">功效</span>。因此,AdaDelta算法<span style="color: black;">再也不</span>需要设定学习率。<h3 style="color: black; text-align: left; margin-bottom: 10px;">RMSprop</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">RMSprop和AdaDelta的思路<span style="color: black;">非常</span><span style="color: black;">类似</span>,两个算法同年分别被Hinton和Zeiler提出,有趣的是Hinton正是Zeiler的导师。RMSprop最早出<span style="color: black;">此刻</span>Hinton在Coursera的课程中,这一算法成果并未<span style="color: black;">发布</span>。其更新公式与<span style="color: black;">第1</span><span style="color: black;">周期</span>的AdaDelta一致:</p>其中指数加权参数的<span style="color: black;">举荐</span>值为0.9,学习率的<span style="color: black;">举荐</span>值为0.001。<h3 style="color: black; text-align: left; margin-bottom: 10px;">Adam</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">自适应矩估计算法(Adaptive Moment Estimation, Adam)是将搜索方向和学习率结合在<span style="color: black;">一块</span><span style="color: black;">思虑</span>的改进算法,其原论文引用量为ICLR官方引用最高的<span style="color: black;">文案</span>。Adam的思路非常清晰:搜索方向上,借鉴动量梯度下降法<span style="color: black;">运用</span>梯度的指数加权;学习率上,借鉴RMSprop<span style="color: black;">运用</span>自适应学习率<span style="color: black;">调节</span>。<span style="color: black;">所说</span>矩估计,则是修正采用指数加权带来的偏差。其更新公式<span style="color: black;">详细</span>如下:</p>作者<span style="color: black;">举荐</span>参数的取值为,以及小量。在<span style="color: black;">实质</span>应用中,Adam的收敛速度<span style="color: black;">一般</span>优于其他梯度类优化算法,这<span style="color: black;">亦</span>是Adam<span style="color: black;">非常</span>受欢迎的一大<span style="color: black;">原由</span>。相比于之前的算法,Adam需要记录两个历史信息和,这使得Adam和AdaDelta<span style="color: black;">同样</span>,需要多储存一个长度为的向量。<h3 style="color: black; text-align: left; margin-bottom: 10px;">AdaMax 与 Nadam</h3><span style="color: black;">咱们</span>在这一小节简要介绍一下Adam的两个重要变体:AdaMax与Nadam。AdaMax是Adam作者在原论文后面<span style="color: black;">自动</span>提出的拓展。其<span style="color: black;">重点</span>思路<span style="color: black;">便是</span>保持搜索方向保持不变,<span style="color: black;">调节</span>自适应学习率的权重,从逐元素2范数推广到逐元素范数,<span style="color: black;">乃至</span>是无穷范数:<span style="color: black;">针对</span>无穷范数的情形,作者<span style="color: black;">意见</span><span style="color: black;">再也不</span>对进行纠偏。<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Nadam则是将Adam中关于搜索方向的部分改为<span style="color: black;">运用</span>Nesterov动量。在引入其更新公式之前,<span style="color: black;">咱们</span><span style="color: black;">首要</span>给出NAG的等价更新形式。</p><span style="color: black;">能够</span>看到,NAG得到等价形式与动量梯度下降法的区别仅仅在最后一步,动量梯度下降法参数的更新量为,而NAG中则是。回顾Adam的更新公式(忽略的部分):仿照NAG的等价形式,<span style="color: black;">咱们</span>将<span style="color: black;">以上</span>等式中的替换为 <span style="color: black;">就可</span>得到Nadam的更新公式:<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">有些</span>可视化结果</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">咱们</span>展示两幅<span style="color: black;">照片</span>来对比<span style="color: black;">区别</span>的优化算法的表现,如图4所示。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/1y1ObuUF34zXAnmw5JaaHDmSeGK4H91Ldnk3qmsfRkUKVG4nibekQicqpprzMFxmVHPcq9EPGg0psDv17dD2uQeg/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">图4:梯度下降算法的可视化结果</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">图4-(a)展示的是六种梯度下降改进算法在Beale函数的函数曲面的优化路径。等高线由红变蓝的过程<span style="color: black;">暗示</span>函数值由大到小,最小值点在图中由五角星标出。所有的算法均从同一点出发,<span style="color: black;">能够</span>看到自适应学习率类的算法(AdaGrad, AdaDelta和RMSprop)的优化路径直接走向了右边的极小值点,<span style="color: black;">然则</span>动量梯度法(绿色曲线)和NAG(紫色曲线)均是先<span style="color: black;">步行到</span>了另一处狭长的区域,再转向极小值点,且NAG的纠正速度快于动量梯度法。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">图4-(b)展示的是六种梯度下降改进算法在鞍点<span style="color: black;">周边</span>的优化路径。<span style="color: black;">所说</span>鞍点<span style="color: black;">便是</span>,满足一阶<span style="color: black;">要求</span>等于0<span style="color: black;">然则</span>海森矩阵的特征值有正有负的点。<span style="color: black;">因为</span>梯度下降算法仅仅<span style="color: black;">思虑</span>一阶<span style="color: black;">要求</span>,理论上会受到鞍点的<span style="color: black;">困惑</span>。<span style="color: black;">能够</span>看到,SGD(红色曲线)在鞍点处停止,动量梯度法(绿色曲线)和NAG(紫色曲线)在鞍点处停留一会后<span style="color: black;">逐步</span>脱离鞍点,而三种自适应学习率算法则能够快速摆脱鞍点。</p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">梯度类算法<span style="color: black;">关联</span><span style="color: black;">科研</span></span></h2>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">并行SGD和分布式SGD</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">在当今无处不在的大数据时代,数据的分布方式和<span style="color: black;">运用</span>方式都<span style="color: black;">再也不</span>像原来<span style="color: black;">同样</span>单一。与之对应的,当数据仍然<span style="color: black;">保留</span>在本地,学者们<span style="color: black;">起始</span><span style="color: black;">科研</span><span style="color: black;">怎样</span>在单机上实现并行SGD来进行加速;当数据分散地<span style="color: black;">保留</span>在<span style="color: black;">区别</span>的节点,<span style="color: black;">怎样</span>实现分布式SGD;更进一步地,<span style="color: black;">针对</span>数据节点<span style="color: black;">非常</span>庞大而又需要<span style="color: black;">守护</span>用户隐私的时代,联邦学习(Federated Learning)版本的SGD<span style="color: black;">一样</span><span style="color: black;">亦</span>是学者们关注的问题。<span style="color: black;">因为</span>篇幅<span style="color: black;">原由</span>,<span style="color: black;">咱们</span>不在<span style="color: black;">这儿</span>进行展开。</p>
<h3 style="color: black; text-align: left; margin-bottom: 10px;">梯度类算法的训练策略</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">除了算法层面的改进,与算法相配套的训练策略<span style="color: black;">或</span>其他技术<span style="color: black;">一样</span>对训练模型有很大的影响。例如:</p><strong style="color: blue;">数据打乱和课程式学习</strong>. 有学者指出避免让数据以固定单一的方式进入模型训练能够<span style="color: black;">提高</span>模型的泛化能力,<span style="color: black;">因此呢</span>在深度学习模型的训练中人们在每一次遍历全样本后都会进行数据打乱(Shuffling)。另一方面,有学者指出让数据以某种有<span style="color: black;">道理</span>的<span style="color: black;">次序</span>参与模型训练能够使模型更快收敛且表现不错,<span style="color: black;">这般</span>的<span style="color: black;">办法</span>被<span style="color: black;">叫作</span>为课程式学习(Curriculum Learning).<strong style="color: blue;">批量归一化</strong>. 批量归一化(Batch normalization)是卷积神经网络中经常<span style="color: black;">运用</span>的技术,它能够对<span style="color: black;">每一个</span>小批量的数据在模型的某些节点进行归一化。<span style="color: black;">海量</span>实验<span style="color: black;">显示</span>,这种技术能够加速梯度类算法的收敛、让梯度类算法<span style="color: black;">运用</span>更大的初始学习率以及减少参数初始化的影响。<strong style="color: blue;">早停</strong>. <span style="color: black;">经过</span>监控验证集的精度指标来决定<span style="color: black;">是不是</span>要提前终止算法训练。<strong style="color: blue;">梯度噪声</strong>. 提出在梯度上<span style="color: black;">增多</span>独立的高斯噪声,来<span style="color: black;">帮忙</span>梯度类算法<span style="color: black;">增多</span>稳健性。<span style="color: black;">她们</span>猜测,<span style="color: black;">增多</span>的梯度噪声能够<span style="color: black;">帮忙</span>算法跳过局部鞍点和次优的极小值点。<h3 style="color: black; text-align: left; margin-bottom: 10px;">梯度类算法仍然面临的挑战</h3>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">尽管<span style="color: black;">咱们</span><span style="color: black;">已然</span>介绍了非常多的梯度类算法的改进算法,<span style="color: black;">然则</span>梯度类算法仍然面临着许多挑战。</p><span style="color: black;">选取</span>合适的学习率是非常困难的事情。尽管在传统的优化<span style="color: black;">行业</span>和<span style="color: black;">咱们</span>之前<span style="color: black;">说到</span>的自适应学习率算法<span style="color: black;">供给</span>了<span style="color: black;">非常多</span><span style="color: black;">处理</span>这个问题的<span style="color: black;">办法</span>。在深度模型的训练中,这些已有的<span style="color: black;">办法</span>仍然会遇到问题,<span style="color: black;">因此呢</span>这仍是一个需要小心<span style="color: black;">思虑</span>和<span style="color: black;">许多</span>尝试的问题。尽管梯度下降算法在理论上有许多性质和结论,但<span style="color: black;">实质</span>得到神经网络<span style="color: black;">常常</span>不满足已有结论的前提假设。在<span style="color: black;">咱们</span>优化超高维非凸的损失函数时,<span style="color: black;">怎样</span><span style="color: black;">保准</span>算法不被鞍点 、平坦区域所<span style="color: black;">困惑</span>,依然是<span style="color: black;">非常</span>挑战的问题。<span style="color: black;">针对</span>超高维得到损失函数,梯度类算法表现地像黑箱优化器,<span style="color: black;">日前</span>上<span style="color: black;">无</span><span style="color: black;">尤其</span>好的<span style="color: black;">办法</span>对这种超高维优化进行可视化分析。参考文献<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Augustin-Louis Cauchy. Mode grale pour la rlution des systs d’ations simultan. Comptes rendus des sces de l’ Acade des sciences de Paris, pages 536–538, 10 1847.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic
gradient descent. arXiv preprint arXiv:1610.08637, 2016.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Haskell B. Curry. The method of steepest descent for non-linear minimization problems. Quarterly of Applied
Mathematics, pages 258–261, 2 1944.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Jia Deng, Wei Dong, Richard Socher, Li Jia Li, and Fei Fei Li. Imagenet: A large-scale hierarchical image
database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Timothy Dozat. Incorporating nesterov momentum into adam. ICLR Workshop, 2016.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic
optimization. Journal of Machine Learning Research, 12(7):257–269, 2011.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Igor Gitman, Hunter Lang, Pengchuan Zhang, and Lin Xiao. Understanding the role of momentum in stochastic gradient methods. In Proceedings of 33rd Conference and Workshop on Neural Information Processing Systems,
2019.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Daniel Ramage Seth Hampson H. Brendan McMahan, Eider Moore and Blaise Aguera y Arcas.
Communication-efficient learning of deep networks from decentralized data. In International Conference on
Machine Learning (ICML), 2016.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. In Proceedings of the 32nd annual international conference on machine learning, 2015.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Rajat Monga Kai Chen Matthieu Devin Quoc V. Le Mark Z. Mao Marc Aurelio Ranzato Andrew Senior Paul
Tucker Ke Yang Jeffrey Dean, Greg S. Corrado and Andrew Y. Ng. Large scale distributed deep networks. In
Neural Information Processing Systems Conference (NIPS 2012), pages 1–11, 2012.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In International Conference
on Learning Representations, pages 1–13, 2015.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">H. Brendan Mcmahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous distributed online
learning. In Neural Information Processing Systems Conference (NIPS 2014), pages 1–9, 2014.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, and J. Martens. Adding gradient noise improves learning for
very deep networks. arXiv preprint arXiv:1511.06807., 2015.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Arkadii S. Nemirovski, Anatoli Juditsky, Guanghui Lan, Shapiro, and A. Robust stochastic approximation
approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/ ). Dok-
l.akad.nauk Sssr, 269, 1983.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Q. Ning. On the momentum term in gradient descent learning algorithms. Neural Netw, 12(1):145–151, 1999.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient
descent. In Neural Information Processing Systems Conference (NIPS 2011), pages 1–22, 2011.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathe-
matics Mathematical Physics, 4(5):1–17, 1964.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic
optimization. In International Conference on Machine Learning, 2012.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Anna Choromanska Sixin Zhang and Yann LeCun. Deep learning with elastic averaging sgd. In Neural Information Processing Systems Conference (NIPS 2015), pages 1–9, 2015.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">R.S. Sutton. Twoproblemswith backpropagationand other steepest-descent learning proceduresfor networks.
proc cognitive sci soc, 1986.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, pages 26–31, 2012.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Panos Toulis and Edoardo M. Airoldi. Asymptotic and finite-sample properties of estimators based on stochastic gradients. Eprint Arxiv, 45(4):1694–1727, 2017.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Ronan Collobert Yoshua Bengio, Jme Louradour and Jason Weston. Curriculum learning. In Proceedings of the
26th annual international conference on machine learning, pages 41–48, 2009.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Matthew D. Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Matthew D. Zeiler. Identifying and attacking the saddle point problem in high-dimensional nonconvex optimization. arXiv preprint arXiv:1406.2572, 2014.</p><span style="color: black;">- END -</span>
我深感你的理解与共鸣,愿对话长流。 期待楼主的下一次分享!” 楼主发的这篇帖子,我觉得非常有道理。 同意、说得对、没错、我也是这么想的等。
页:
[1]