wloe2gf 发表于 2024-7-28 01:18:52

蓝昶:谷歌分布式设备学习优化实践

<img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiatBXkeQWuUGBmUQ4xiahXXVm7Ohkcwd798zgEzw4l4EBMjUqNd5HtdBibHyicSCVfIIpG0ZkFwdSsFQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P24AuOXVc4R1HkK8IuBSiabrUJLTcv6OBQ4GmmkkyhG8tCGvcMEoDZPQQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">分享嘉宾:蓝昶</span><span style="color: black;">博士</span><span style="color: black;"> Google</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">编辑整理:何文婷 字节跳动</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">出品平台:DataFunTalk</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">导读:</span></strong></span><span style="color: black;">随着<span style="color: black;">设备</span>学习模型和数据规模的增长,大规模分布式<span style="color: black;">设备</span>学习训练的性能越来越<span style="color: black;">作为</span>公有云用户关注的问题。本文将介绍谷歌云 Vertex AI 平台在分布式<span style="color: black;">设备</span>学习训练性能优化方面做的一系列工作。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">详细</span>将围绕以下几点展开:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">训练优化的背景</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Fast Socket: NCCL的高性能网络栈</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">用Reduction Server加速梯度聚合</span></p><span style="color: black;"><strong style="color: blue;"><span style="color: black;">01</span></strong></span><strong style="color: blue;"><span style="color: black;">训练优化的背景</span></strong>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">1.&nbsp;Google Vertex AI平台简介</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Vertex AI是Google的一站式托管云服务,是一个集<span style="color: black;">成为了</span>AutoML和AI Platform的AI<span style="color: black;">设备</span>学习以及服务平台。Vertex AI覆盖了从数据到特征工程、模型训练,超参<span style="color: black;">调节</span>和模型预测以及可预测性支持在内的一系列的<span style="color: black;">需要</span>。接下来的分享<span style="color: black;">重点</span>是聚焦在模型加速的技术部分。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">2.&nbsp;训练优化的背景</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2ZoNLVUPWMpPCm22H7VWTsHMqse17m25ibM3ymHOPia05aUX2IwNCiaRog/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>先回顾一下<span style="color: black;">咱们</span><span style="color: black;">为何</span>需要做分布式的训练,以及<span style="color: black;">咱们</span><span style="color: black;">做为</span>一个平台<span style="color: black;">为何</span>需要关注这个问题。<span style="color: black;">近期</span>几年<span style="color: black;">咱们</span><span style="color: black;">能够</span>看到深度模型最<span style="color: black;">起始</span>是在学术界的各个任务上取得了<span style="color: black;">有些</span>SOTA的突破,<span style="color: black;">咱们</span><span style="color: black;">近期</span>几年<span style="color: black;">亦</span>看到这些应用<span style="color: black;">起始</span>渗入到企业级的用户,并且带来实实在在的业的提效。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">云平台上的负载<span style="color: black;">亦</span>从最初的视觉模型<span style="color: black;">起始</span>,<span style="color: black;">此刻</span>变得非常多样,语言模型跟多模态模型<span style="color: black;">亦</span><span style="color: black;">逐步</span><span style="color: black;">增加</span>。随着大规模训练模型,语言模型在各个任务上的效果<span style="color: black;">亦</span>实现了突破的话。<span style="color: black;">咱们</span>看到用户的模型负载规模<span style="color: black;">亦</span>是在指数的增长。支持这些海量的大规模型的训练,<span style="color: black;">亦</span>给<span style="color: black;">咱们</span>带来<span style="color: black;">有些</span>新的挑战。简单<span style="color: black;">来讲</span>,单卡训练<span style="color: black;">再也不</span>是主流,训练大模型所需要的算力在指数增长,<span style="color: black;">因此</span>单卡不足以在足够的时间里面输出足够的算力。<span style="color: black;">另外</span>,<span style="color: black;">因为</span>GPU显存的限制,除非<span style="color: black;">经过</span><span style="color: black;">有些</span>跟主存进行数据交换的<span style="color: black;">有些</span>trick,简单的单卡训练<span style="color: black;">亦</span><span style="color: black;">再也不</span>可行。在这种<span style="color: black;">状况</span>下,分布式训练<span style="color: black;">做为</span>一种水平扩展的<span style="color: black;">处理</span>方式,是<span style="color: black;">日前</span>主流的<span style="color: black;">选取</span>。从<span style="color: black;">日前</span>扩展的方式来看,<span style="color: black;">咱们</span>最早遇到的是算力的限制,<span style="color: black;">因此</span><span style="color: black;">咱们</span>最早是在主流的框架里面看到<span style="color: black;">非常多</span>数据并行的<span style="color: black;">有些</span>API支持。<span style="color: black;">此刻</span>随着模型规模的继续增大,模型并行以及混合并行<span style="color: black;">亦</span><span style="color: black;">起始</span>进入主流的AI框架的<span style="color: black;">研发</span>路线。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3. 水平扩展的挑战:内存墙&nbsp;</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图2水平扩展面临的一个<span style="color: black;">基本</span>的挑战是内存墙的问题。这个问题有两个方面,<span style="color: black;">首要</span><span style="color: black;">咱们</span><span style="color: black;">能够</span>看到,内存带宽是在<span style="color: black;">长时间</span>来看是慢于<span style="color: black;">明显</span>慢于算力的增长,在过去的20年里面计算的算力<span style="color: black;">提高</span>了九万倍,但内存的带宽只<span style="color: black;">加强</span>了30倍,这是一个<span style="color: black;">基本</span>的差距,并且差距会越来越大。其次,广义的内存带宽既<span style="color: black;">包含</span>片上的内存带宽,<span style="color: black;">亦</span><span style="color: black;">包含</span>片间的互联带宽,还有网络带宽。这几类带宽的<span style="color: black;">长时间</span>趋势<span style="color: black;">亦</span>都面临内存墙的问题。<span style="color: black;">咱们</span>从这个图里面<span style="color: black;">能够</span>看到,内存带宽,高速互联带宽跟网络带宽之间<span style="color: black;">亦</span>是<span style="color: black;">始终</span>存在着数量级上的差距,<span style="color: black;">因此</span>随着训练规模的<span style="color: black;">持续</span>增大,训练的性能瓶颈<span style="color: black;">最后</span>会落到网络带宽上。<span style="color: black;">因为</span>有这两个方面的影响,<span style="color: black;">倘若</span>只是依靠硬件本身的发展,模型规模<span style="color: black;">火速</span>就会撞到内存墙,这<span style="color: black;">亦</span><span style="color: black;">便是</span><span style="color: black;">为何</span><span style="color: black;">咱们</span>需要在框架跟算法方面做<span style="color: black;">更加多</span>工作的大背景。<span style="color: black;">那样</span>从一个云平台的<span style="color: black;">方向</span>来看,<span style="color: black;">咱们</span>做性能优化工作的首要<span style="color: black;">目的</span>当然是<span style="color: black;">提高</span>性能,为用户降低TCO,其次,<span style="color: black;">咱们</span><span style="color: black;">期盼</span>做到非侵入式的优化,做到框架无关,让用户保持<span style="color: black;">选取</span>框架的自由度。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在<span style="color: black;">实质</span>的场景里面,<span style="color: black;">咱们</span><span style="color: black;">能够</span>看到内存墙<span style="color: black;">针对</span>训练性能的影响。上图<span style="color: black;">表示</span>了一个例子,在单机跟多机的场景下,<span style="color: black;">针对</span>同一个模型的单步训练时间的比较,<span style="color: black;">因为</span>在多机跟单机之间计算跟参数更新的时间大体是恒定的,真正拖慢训练时间的是梯度聚合这一步,占用了大概2/3的时间,<span style="color: black;">因此</span>all-reduce<span style="color: black;">常常</span>是分布式训练的性能瓶颈。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.&nbsp;训练优化的技术路径</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2h4Z3XRLwpUnlnLRic9aFuLCg80iavayUqD3l7beh91icw8KbDZca10KuQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">关于深度的分布式训练,<span style="color: black;">重点</span>工作主从技术栈上呈现从浅层到深层的一个过程。前三类的优化基本上是<span style="color: black;">处在</span>框架层,需要平台为用户<span style="color: black;">供给</span><span style="color: black;">基本</span>的框架支持。<span style="color: black;">例如</span>说在计算图的并行化策略方面,<span style="color: black;">咱们</span><span style="color: black;">经过</span>GSPMD和GPipe<span style="color: black;">供给</span>了<span style="color: black;">包含</span>数据并行、模型并行和流水线并行的通用化的并行化策略的抽象层。<span style="color: black;">另外</span>,<span style="color: black;">咱们</span><span style="color: black;">经过</span>DeepSpeed来做支持,用ZeRO (Zero Redundancy Optimizer)来优化Optimizer的显存<span style="color: black;">运用</span>,以及<span style="color: black;">咱们</span><span style="color: black;">能够</span>用低精度的压缩来加速参数的同步。在<span style="color: black;">这次</span>的分享里面,我会重点分享最后一类的优化,<span style="color: black;">亦</span><span style="color: black;">便是</span>在集合通信层的<span style="color: black;">有些</span>优化。在AI框架的设计里面,这是讨论的比较少,<span style="color: black;">然则</span><span style="color: black;">针对</span>一个平台<span style="color: black;">来讲</span>是非常重要的一类优化。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">首要</span><span style="color: black;">针对</span><span style="color: black;">基本</span><span style="color: black;">设备</span>的优化<span style="color: black;">常常</span>是需要有一个全局的提效,其次这类优化<span style="color: black;">针对</span>用户跟上层的框架完全透明,不需要改动上层的代码就能够真正落地。<span style="color: black;">咱们</span>接下来<span style="color: black;">经过</span><span style="color: black;">有些</span><span style="color: black;">详细</span>的例子来看<span style="color: black;">怎样</span>以集合通信层为入手点来做这一类的优化。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">5.&nbsp;NCCL简介</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2bry0pUQC9gV4YCn8HXDS9xlvc3vc06U3snNkDbiaPniahRhPWHMfk95g/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在GPU的训练场景里面,集合通信<span style="color: black;">常常</span>跟NCCL是同义词。NCCL是NVIDIA实现的一个集合通信库,它<span style="color: black;">供给</span>了像allreduce, allgather, broadcast等集合通信原语以及高性能的实现,用来支持多机多卡的训练。<span style="color: black;">针对</span>节点内通信,支持NVLink,PCIE和device P2P等方式;<span style="color: black;">针对</span>节点间通信支持像Socket和Infiniband等网络协议,并且网络协议<span style="color: black;">能够</span>支持<span style="color: black;">经过</span>用插件的形式来扩展。从软件栈的<span style="color: black;">方向</span>来看,NCCL对<span style="color: black;">重点</span>的训练框架都<span style="color: black;">供给</span>了支持,并且是主流框架默认<span style="color: black;">运用</span>的GPU的通信库。拿网络的协议站做一个类比的话,NCCL基本上跟IP协议<span style="color: black;">同样</span>,是<span style="color: black;">全部</span>协议栈的narrow waist的位置。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>认为NCCL<span style="color: black;">尤其</span>适合用来<span style="color: black;">做为</span>框架无关的性能<span style="color: black;">提高</span>的着力点。<span style="color: black;">详细</span>到几个通信的优化,<span style="color: black;">咱们</span><span style="color: black;">能够</span>分为两类,一类是<span style="color: black;">针对</span>底层通信网络栈的优化,而另一类是<span style="color: black;">针对</span>几个通信算法的优化。<span style="color: black;">那样</span>接下来<span style="color: black;">咱们</span>将会用两个例子来分享<span style="color: black;">咱们</span>在这两个方面做的工作。</span></p><span style="color: black;"><strong style="color: blue;"><span style="color: black;">02</span></strong></span><strong style="color: blue;"><span style="color: black;">Fast Socket:NCCL的高性能网络栈</span></strong>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">首要</span><span style="color: black;">咱们</span>要介绍的工作是<span style="color: black;">咱们</span>对NCCL实现了一个高性能的底层网络栈。<span style="color: black;">为何</span>需要做这方面的工作呢?<span style="color: black;">咱们</span>前面<span style="color: black;">说到</span>NCCL用了非常多巧妙的设计和底层的优化来实现高性能的集合通信,<span style="color: black;">然则</span>它<span style="color: black;">详细</span>的实践还有<span style="color: black;">非常多</span>需要<span style="color: black;">加强</span>的<span style="color: black;">地区</span>,对此<span style="color: black;">咱们</span>对大<span style="color: black;">信息</span>和小<span style="color: black;">信息</span>做分别的讨论。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">1.&nbsp;提<span style="color: black;">高挑</span><span style="color: black;">信息</span>的吞吐率</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2aOCVedIS9tqSWCNKcmegeaET34fPiawaZLpvzjxynTOMRSfS6DEk3ibA/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">针对</span>大<span style="color: black;">信息</span>的传输,重点是需要<span style="color: black;">加强</span>吞吐,但<span style="color: black;">咱们</span>实测<span style="color: black;">发掘</span>,在高带宽非RDMA的环境里面,NCCL的网络吞吐性能<span style="color: black;">常常</span>不尽如人意,<span style="color: black;">因此</span>在100G的以太网环境里面,<span style="color: black;">实质</span>用到的带宽远远达不到line rate,<span style="color: black;">因此呢</span>有巨大的<span style="color: black;">提高</span>空间。NCCL的默认实现其实<span style="color: black;">亦</span><span style="color: black;">思虑</span>了<span style="color: black;">怎样</span><span style="color: black;">运用</span>高带宽网络的问题。<span style="color: black;">首要</span>它的默认实现是在多个节点之间的<span style="color: black;">每一个</span>环,NCCL都会<span style="color: black;">创立</span>多个TCP的链接来并行传输<span style="color: black;">信息</span>,<span style="color: black;">另外</span>它本身<span style="color: black;">亦</span>会<span style="color: black;">运用</span>多个Ring来处理集合通信的请求(Ring是NCCL里面自己定义的一个概念,<span style="color: black;">能够</span>理解为是一个独立的通道,<span style="color: black;">每一个</span>通信通道<span style="color: black;">亦</span>需要独立的占用CPU跟GPU的kernel资源)。<span style="color: black;">然则</span>这种简单的<span style="color: black;">运用</span><span style="color: black;">海量</span>连接的方式<span style="color: black;">针对</span>大<span style="color: black;">信息</span>的吞吐率效果并<span style="color: black;">不睬</span>想,<span style="color: black;">原由</span>有两个:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">运用</span><span style="color: black;">海量</span>连接和Ring会占用CPU跟GPU的资源,反而会影响到计算本身的速度,并且会<span style="color: black;">增多</span>性能的抖动。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在<span style="color: black;">实质</span>的网络环境里面,<span style="color: black;">区别</span>的TCP流,它的带宽占用并不一致,<span style="color: black;">因此</span>会<span style="color: black;">引起</span>straggler effect,<span style="color: black;">因此</span>其他<span style="color: black;">已然</span>完成的流会等需要等待最后最慢完成的流,<span style="color: black;">作为</span>性能的瓶颈。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2wuJ48WiaKADNNOwib11nlgVse49N4WfKAUxcyjplYfSFpm2xRz9R2rYQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">能够</span>仔细看一看NCCL的传输层的实现,简单把它抽象成右边这个图里面<span style="color: black;">表示</span>的这么一个结构。当NCCL<span style="color: black;">得到</span>最上层的传输请求之后,他会用Proxy的线程将准备传输的<span style="color: black;">信息</span>切片,<span style="color: black;">而后</span>用Round Robin的方式把切片后的数据交给Helper线程,<span style="color: black;">经过</span><span style="color: black;">区别</span>的Socket发送到对面的节点。理想的<span style="color: black;">状况</span>下,Socket之间的带宽相会是相等的,<span style="color: black;">因此</span>这种方式看起来应该是<span style="color: black;">无</span>什么问题。但<span style="color: black;">实质</span>的数据中心网络里面影响带宽的方式<span style="color: black;">非常多</span>,像<span style="color: black;">倘若</span>是有multi-path的话,<span style="color: black;">区别</span>的connection会被路由到<span style="color: black;">区别</span>的path上,路由器的buffer占用<span style="color: black;">亦</span>不<span style="color: black;">同样</span>,<span style="color: black;">因此</span>就会<span style="color: black;">引起</span><span style="color: black;">区别</span>的连接的带宽并不均等,<span style="color: black;">因此</span>大<span style="color: black;">信息</span>的吞吐率<span style="color: black;">常常</span>会被<span style="color: black;">这般</span>的straggler连接给拖慢。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2Pm7KDHTAvHbkRla6mbOYiawSs2KfxyeS6icsf2Fl5bs4NFeVwyJdM1gw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为<span style="color: black;">认识</span>决这个问题,<span style="color: black;">咱们</span>采用了动态负载<span style="color: black;">平衡</span>的办法。<span style="color: black;">首要</span><span style="color: black;">咱们</span>在NCCL原有架构的<span style="color: black;">基本</span>上,继续做细粒度的数据切分,流式地处理数据切片,其次,<span style="color: black;">咱们</span>改变了原来Proxy线程<span style="color: black;">经过</span>Round Robin分配负载的做法,<span style="color: black;">经过</span>感知<span style="color: black;">每一个</span>Socket当前的负载量和进度,把负载分配到最快的Socket之上,<span style="color: black;">这儿</span>的难点就应该<span style="color: black;">怎样</span>感知<span style="color: black;">每一个</span>Socket的负载。<span style="color: black;">这儿</span>一个小背景是Socket线程和Helper线程之间<span style="color: black;">运用</span>的是个一个无锁队列进行数据交互,一个很自然的想法<span style="color: black;">便是</span>说<span style="color: black;">咱们</span><span style="color: black;">期盼</span><span style="color: black;">瞧瞧</span><span style="color: black;">每一个</span>线程对应的队列长度,缓冲区的长度越长,说明对面对应的Socket越慢,反之就越快。这<span style="color: black;">亦</span>是<span style="color: black;">咱们</span>最初的版本采用的一个做法,<span style="color: black;">然则</span><span style="color: black;">实质</span>上<span style="color: black;">咱们</span><span style="color: black;">发掘</span>效果<span style="color: black;">加强</span>有限。<span style="color: black;">原由</span>在于除了用户态的无锁队列之外,数据还<span style="color: black;">能够</span>在内核的Socket的缓冲区被缓冲,<span style="color: black;">因此</span>你<span style="color: black;">倘若</span>只看队列的长度,并<span style="color: black;">不可</span>准确地反映出Socket的当前的负载。修复这个问题<span style="color: black;">亦</span>很简单,<span style="color: black;">咱们</span><span style="color: black;">经过</span><span style="color: black;">调节</span>内核Socket的参数<span style="color: black;">能够</span>关闭发送的缓冲区,<span style="color: black;">这般</span>就<span style="color: black;">能够</span>让信号的反馈更加准确。<span style="color: black;">另外</span>,<span style="color: black;">咱们</span><span style="color: black;">经过</span>压缩缓冲区的<span style="color: black;">体积</span><span style="color: black;">亦</span><span style="color: black;">能够</span><span style="color: black;">掌控</span>在<span style="color: black;">海量</span><span style="color: black;">运用</span>Socket的<span style="color: black;">状况</span>下。<span style="color: black;">针对</span>内核内存的占用,总体上<span style="color: black;">针对</span>性能<span style="color: black;">亦</span>是有<span style="color: black;">帮忙</span>的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2sjkKMV3bcPKMvS9xmzYOicvRyv2goKSYp0w57MfAghY54ibiasHpNNqPw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">除了动态负载<span style="color: black;">平衡</span>,<span style="color: black;">咱们</span>还改进了NCCL<span style="color: black;">针对</span>请求的处理方式。NCCL本来的实现是<span style="color: black;">不可</span>并行处理传输的<span style="color: black;">信息</span>的,它只能在完成当前的请求之后<span style="color: black;">才可</span>处理下一个请求。<span style="color: black;">咱们</span>改进了这个方式,<span style="color: black;">经过</span>实现对请求队列的look ahead处理,<span style="color: black;">能够</span>流式地处理这个变行请求。<span style="color: black;">另一</span>,<span style="color: black;">咱们</span>开启了发送端的zero copy来降低用户态到内核态的拷贝开销,<span style="color: black;">针对</span>大于十KB的<span style="color: black;">信息</span>,<span style="color: black;">咱们</span>实测会有有<span style="color: black;">显著</span>的<span style="color: black;">提高</span>效果。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">2.&nbsp;降低小<span style="color: black;">信息</span>的延迟</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2Nzxo0X6ZwaNQwZzntbCIH3HQvHP2qRcolicqlvN8mUv4ibibOuiaVrnicXw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">针对</span>小<span style="color: black;">信息</span>,<span style="color: black;">咱们</span>重点关注的是降低延迟。前面<span style="color: black;">说到</span><span style="color: black;">信息</span>的发送是需要<span style="color: black;">经过</span>Proxy线程到helper线程的数据交换,每一次的发送都会<span style="color: black;">导致</span>多次的线程唤醒,跟上下文切换,<span style="color: black;">导致</span>比<span style="color: black;">很强</span>的开销。<span style="color: black;">咱们</span><span style="color: black;">处理</span>这个问题的<span style="color: black;">办法</span>是<span style="color: black;">经过</span>Proxy线程直接<span style="color: black;">掌控</span>Socket的发送,避免线程切换的开销。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为了实现这个优化,<span style="color: black;">咱们</span><span style="color: black;">亦</span>重新设计了NCCL<span style="color: black;">掌控</span><span style="color: black;">信息</span>的结构以及传输<span style="color: black;">掌控</span><span style="color: black;">信息</span>的<span style="color: black;">办法</span>。在数据<span style="color: black;">信息</span>足够小的时候,<span style="color: black;">咱们</span><span style="color: black;">能够</span>相当于是把<span style="color: black;">信息</span>内inline在<span style="color: black;">掌控</span><span style="color: black;">信息</span>里面,<span style="color: black;">经过</span>proxy直接发送。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">最后<span style="color: black;">咱们</span><span style="color: black;">亦</span>引入了内核的Busy polling用于<span style="color: black;">掌控</span>socket,让内核来动态的来poll socket<span style="color: black;">能够</span>明显地降低小<span style="color: black;">信息</span>的延迟跟抖动。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3.&nbsp;End&nbsp;to end 测试</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2mSFokPK7iaqeEEyjqwIxC9ibvda0ePL5SSJz7WszLy3a7dThrElGo05g/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">之前介绍了<span style="color: black;">咱们</span><span style="color: black;">详细</span>做的<span style="color: black;">有些</span>优化的<span style="color: black;">办法</span>,<span style="color: black;">咱们</span><span style="color: black;">亦</span>测试了NCCL在64M到1G的<span style="color: black;">信息</span><span style="color: black;">体积</span>上,<span style="color: black;">针对</span>all-reduce的网络吞吐率,<span style="color: black;">能够</span>看到经过Fast Socket的优化之后,NCCL在各个<span style="color: black;">体积</span>的带宽测试里面都取得了60%以上的加速比。在<span style="color: black;">实质</span>的100G以太网络里面,<span style="color: black;">实质</span>带宽<span style="color: black;">亦</span>能跑到将近line rate的数字。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2rNrbibPObsNewQOaDDW5b4CYVXYbtwkVTyPZwKbydhuSd61mme286pw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">针对</span>end to end的性能测试,Fast Socket<span style="color: black;">亦</span>能取得一个比较<span style="color: black;">显著</span>的提速。图中<span style="color: black;">表示</span>的是在Fine-tune BERT-Large这种模型的时候,Fast Socket可在每秒训练的步数上会有大概30%以上的提速。这种提速是面向全平台的,<span style="color: black;">因此</span><span style="color: black;">咱们</span>不需要用户侧做任何的改动,就能让用户<span style="color: black;">实质</span>的落地加速的效果。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.&nbsp;小结</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P2BR64VYqweVDic8uQhgVks1Nlfe7cDSqEnkjvp51hbjgsia0oibWbZSWvQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">这里</span>对Fast Socket做一个简单的小结。Fast Socket是<span style="color: black;">咱们</span>为NCCL在高带宽的网络环境里面实现的一个优化的网络栈,<span style="color: black;">由于</span>这些优化都<span style="color: black;">位置于</span>NCCL的通信层,<span style="color: black;">因此</span>支持所有的主流分布式的框架,并且能够做到全平台的加速。<span style="color: black;">日前</span>Fast Socket以插件的方式类似于Google Cloud的Deep Learning VM和Vertex AI等<span style="color: black;">设备</span>学习环境里,<span style="color: black;">详细</span>的实现代码,<span style="color: black;">亦</span>以开源的形式开放给社区,欢迎试用和交流。</span></p><span style="color: black;"><strong style="color: blue;"><span style="color: black;">03</span></strong></span><strong style="color: blue;"><span style="color: black;">用Reduction Server加速梯度聚合</span></strong>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>前面的<span style="color: black;">说到</span>的Fast Socket<span style="color: black;">重点</span>是用工程的方式对集合通讯层做了非常多细致的底层优化。在这个工作里面,<span style="color: black;">咱们</span>换一个视角引出下一个工作,<span style="color: black;">瞧瞧</span><span style="color: black;">咱们</span><span style="color: black;">怎样</span>从改进集合通信算法的<span style="color: black;">方向</span>来加速梯度的聚合。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">1.&nbsp;All-reduce简介</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/zHbzQPKIBPiap3CJcDalVlgur4aHNib0P25X542sSQVRrlnLcuHsmtkjlTZHWP79Kn2fDt9WabFNapGbAHyK5xSA/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>先给一个all-reduce的<span style="color: black;">详细</span>的例子来回顾all-reduce的语义,<span style="color: black;">这儿</span>面<span style="color: black;">每一个</span>worker的节点有一个等长的数组,在训练里面<span style="color: black;">一般</span>是对应于某个参数的梯度,<span style="color: black;">由于</span>在数据并行的训练里面,<span style="color: black;">每一个</span>worker的训练输入的数据的批次不<span style="color: black;">同样</span>,梯度的数值<span style="color: black;">亦</span><span style="color: black;">区别</span>,<span style="color: black;">因此呢</span><span style="color: black;">咱们</span>需要对<span style="color: black;">区别</span>的批次的梯度求和,<span style="color: black;">亦</span><span style="color: black;">便是</span>对应于all-reduce的操作。操作完成之后,<span style="color: black;">每一个</span>worker都会得到相同的结果。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">因此</span>总结<span style="color: black;">来讲</span>,all-reduce的语义是:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">规约所有节点上的数组,并且把这个结果返回到所有节点。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">All-reduce有<span style="color: black;">非常多</span><span style="color: black;">详细</span>的实现,<span style="color: black;">一般</span>是<span style="color: black;">能够</span>由两步组合而成,<span style="color: black;">一般</span>分别<span style="color: black;">是由于</span>reduce-scatter和all-gather的两步组合而成。Reduce-scatter完成之后,<span style="color: black;">每一个</span>节点各自<span style="color: black;">持有</span>N分之一完整规约过后的数据。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在右图这个例子里面,<span style="color: black;">每一个</span>节点需要<span style="color: black;">最少</span>要发送(n-1)/n的数据,这一点非常容易证明。<span style="color: black;">例如</span>说在图中的例子,节点1,2,3分别要来自需要来自节点0的1/4的数据,<span style="color: black;">因此</span>这个节点0<span style="color: black;">最少</span>需要送发送3/4的数据,<span style="color: black;">一样</span>属于节点0规约的1/4的数据块需要来自节点1,2,3相同位置的数据块,<span style="color: black;">因此</span>它<span style="color: black;">亦</span><span style="color: black;">最少</span>需要接受3/4的数据。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">下一步的all-gather则是将各个节点上1/4的规约结果发送到所有的节点。效果上等价于四次的broadcast。<span style="color: black;">咱们</span><span style="color: black;">亦</span>很容易证明all-gather的<span style="color: black;">每一个</span>节点<span style="color: black;">亦</span>需要收发(n-1)/n的数据,<span style="color: black;">因此呢</span>all-reduce里面<span style="color: black;">每一个</span>节点需要传输大概两倍于原始输入的数据。这个是个非常关键的结论,<span style="color: black;">咱们</span>待会会回到结论来<span style="color: black;">瞧瞧</span><span style="color: black;">怎样</span>优化这一点。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">2.&nbsp;All-reduce性能分析</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">能够</span><span style="color: black;">经过</span>这么一个简单的模型来分析all-reduce的性能。<span style="color: black;">咱们</span><span style="color: black;">能够</span>定义算法带宽为输入数据的<span style="color: black;">体积</span>除以all-reduce执行的时间。<span style="color: black;">例如</span>说<span style="color: black;">每一个</span>节点的输入数据是1G,<span style="color: black;">而后</span>all-reduce用了一秒,算法带宽<span style="color: black;">便是</span>1G/秒。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">能够</span>把算法带宽拆分为两项,<span style="color: black;">第1</span>项是输入数据的<span style="color: black;">体积</span>除以<span style="color: black;">实质</span>在网络里面传输的数据,<span style="color: black;">那样</span><span style="color: black;">咱们</span><span style="color: black;">叫作</span>为算法效率,这一项<span style="color: black;">针对</span>一个特定的算法是一个常数,对,<span style="color: black;">例如</span>说<span style="color: black;">针对</span>ring all-reduce来讲,这一项<span style="color: black;">便是</span>n/(2(n-1)),当节点数N非常大的时候,这个数相当于1/2,第二项<span style="color: black;">便是</span><span style="color: black;">实质</span>传输的数据除以执行时间,<span style="color: black;">亦</span><span style="color: black;">便是</span><span style="color: black;">实质</span>的网络<span style="color: black;">或</span>说总线的吞吐率,<span style="color: black;">咱们</span><span style="color: black;">叫作</span>之为总线带宽。这一项<span style="color: black;">亦</span>受到<span style="color: black;">实质</span>的硬件和协议栈的限制。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为了<span style="color: black;">加强</span><span style="color: black;">全部</span>算法带宽,有两种思路:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">用更新更好的硬件<span style="color: black;">加强</span>总线开关,<span style="color: black;">而后</span>再加上更优化的网络栈实现,<span style="color: black;">例如</span>说像<span style="color: black;">咱们</span>刚才<span style="color: black;">说到</span>的fast socket,<span style="color: black;">或</span>说用Infiniband RDMA等等。<span style="color: black;">然则</span>从<span style="color: black;">咱们</span>刚才<span style="color: black;">说到</span>的内存墙的趋势,我们<span style="color: black;">能够</span>看到硬件带宽的增长始终是有限度的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">经过</span><span style="color: black;">加强</span>算法的效率,在这个场景下,<span style="color: black;">亦</span><span style="color: black;">便是</span>降低数据的传输量。一个<span style="color: black;">能够</span>证明的结论是all-reduce的算法效率理论上界是ring all-reduce<span style="color: black;">日前</span>的水平,<span style="color: black;">亦</span><span style="color: black;">便是</span>当N非常大,当工作节点的数量非常大的时候,大概<span style="color: black;">便是</span>1/2。<span style="color: black;">咱们</span>的reduction server工作<span style="color: black;">调节</span>了all-reduce的一个设定,<span style="color: black;">经过</span>一个稍微<span style="color: black;">区别</span>的思路,算法效率<span style="color: black;">说到</span>了<span style="color: black;">加强</span>到<span style="color: black;">日前</span>的两倍。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3.&nbsp;Reduction Server</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Reduction Server启发于parameter server的通信方式。在parameter server的架构里面,worker的节点在<span style="color: black;">每一个</span>iteration只传输一次的参数数据。受到启发。<span style="color: black;">便是</span>说<span style="color: black;">实质</span>上<span style="color: black;">咱们</span><span style="color: black;">能够</span>想在all-reduce的框架里面,<span style="color: black;">咱们</span>是不是<span style="color: black;">亦</span><span style="color: black;">能够</span><span style="color: black;">经过</span>这种方式来做集合运算。<span style="color: black;">咱们</span>的方式是引入了reduction server节点,它的通信拓扑跟parameter server是一致的,<span style="color: black;">然则</span>节点的实现更加简单,<span style="color: black;">例如</span>说节点,它不需要<span style="color: black;">保留</span>参数,<span style="color: black;">亦</span>不需要计算梯度,它只要在<span style="color: black;">每一个</span>action里面规约来自worker节点的梯度数据,并且返回给worker节点,<span style="color: black;">经过</span>这种方式,worker<span style="color: black;">亦</span>只需要发送跟接收一次完整的完整的梯度数据,并且<span style="color: black;">咱们</span><span style="color: black;">能够</span><span style="color: black;">经过</span>实现流式的规约,<span style="color: black;">隐匿</span>worker收发之间的延迟,充分的利用双向的带宽。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">另一</span>一个非常重要的点是,<span style="color: black;">咱们</span>虽然<span style="color: black;">增多</span>了额外的节点数量,<span style="color: black;">然则</span>这些节点都是轻量级的CPU节点,总开销,<span style="color: black;">倘若</span>用公有云的价格来看的话,总开销的节点开销是GPU节点的10%以内,并且还有一个<span style="color: black;">优良</span>是<span style="color: black;">咱们</span><span style="color: black;">能够</span>把这些轻量级的CPU节点跟他其他网络利用率低的节点混合<span style="color: black;">安排</span>,<span style="color: black;">因此</span><span style="color: black;">实质</span>上<span style="color: black;">增多</span>的成本<span style="color: black;">能够</span>忽略不计。这个表里总结了跟传统的all-reduce算法相比,reduction server能做到的<span style="color: black;">优良</span>,<span style="color: black;">首要</span>它能把数据的传输量减半,<span style="color: black;">亦</span><span style="color: black;">便是</span>说算法带宽相当于是原来的两倍,<span style="color: black;">另一</span>一个<span style="color: black;">优良</span>是它能把延迟的量级从ring all-reduce的O(N)降到O(1),这一点是<span style="color: black;">针对</span>小<span style="color: black;">信息</span>的性能<span style="color: black;">亦</span>是非常重要的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>来<span style="color: black;">瞧瞧</span><span style="color: black;">咱们</span>实现的方式,<span style="color: black;">那样</span>在worker的节点端,<span style="color: black;">咱们</span>基于NCCL以下的通信层实现了一个到reduction server的通信层,<span style="color: black;">因此</span><span style="color: black;">咱们</span><span style="color: black;">能够</span>在不改动框架的<span style="color: black;">状况</span>下实现从all-reduce到reduction server的无缝切换跟加速。在reduction server的节点端,<span style="color: black;">咱们</span>基于Fiber实现了高性能的网络通信层。在这之上是一个轻量级的规约引擎。规则引擎的<span style="color: black;">重点</span>工作是<span style="color: black;">经过</span>高性能SIMD优化过后的算子对输入的数据进行规约。<span style="color: black;">咱们</span>在规约引擎<span style="color: black;">亦</span>实现了完整的数据类型支持,并且能够支持混合精度的压缩、规约。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.&nbsp;训练性能&amp;TCO</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">最后<span style="color: black;">咱们</span><span style="color: black;">瞧瞧</span>reduction server<span style="color: black;">针对</span>训练性能的<span style="color: black;">提高</span>,<span style="color: black;">咱们</span><span style="color: black;">能够</span>看到它<span style="color: black;">针对</span>各个<span style="color: black;">体积</span>的<span style="color: black;">信息</span>的all-reduce操作都会有<span style="color: black;">显著</span>的性能<span style="color: black;">提高</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在end to end的测试里面,相<span style="color: black;">针对</span>传统的ring all-reduce, reduction server<span style="color: black;">能够</span>把训练速度<span style="color: black;">提高</span>75%<span style="color: black;">上下</span>。除了纯粹性能方面的优化,下面这个表的例子里面<span style="color: black;">咱们</span><span style="color: black;">能够</span>看到,<span style="color: black;">由于</span>相当于说<span style="color: black;">咱们</span>的速度变快了,<span style="color: black;">咱们</span>花费的时间<span style="color: black;">亦</span>少,<span style="color: black;">因此</span><span style="color: black;">便是</span>说它<span style="color: black;">实质</span>上总成本<span style="color: black;">亦</span>能降低。即使<span style="color: black;">思虑</span>了额外的CPU节点的开销在内,用户仍然<span style="color: black;">能够</span>大幅降低训练的TCO。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">日前</span>reduction server<span style="color: black;">已然</span>集成到Vertex AI的平台,用户无需改动代码就<span style="color: black;">能够</span>很方便地为<span style="color: black;">日前</span>自己已有的分布式训练任务开启reduction server的支持。<span style="color: black;">咱们</span>在Vertex AI的平台的网站上<span style="color: black;">亦</span>发发布了相应的博客文档以及Notebook的样例,有兴趣<span style="color: black;">能够</span>继续参考。</span></p><span style="color: black;"><strong style="color: blue;"><span style="color: black;">04</span></strong></span><strong style="color: blue;"><span style="color: black;">总结与展望</span></strong>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">总结一下今天分享的内容,内存墙问题是<span style="color: black;">日前</span>大规模模型训练很难避开的一个问题,并且<span style="color: black;">特别有</span>可能是一个<span style="color: black;">长时间</span>的问题,只靠硬件的自然演进应该是<span style="color: black;">不足</span>的,<span style="color: black;">因此</span>这<span style="color: black;">亦</span>是<span style="color: black;">咱们</span>在框架的<span style="color: black;">基本</span>上,技术栈的各个层面做性能工作的一个大背景。<span style="color: black;">将来</span>随着业务模型的变化,<span style="color: black;">咱们</span><span style="color: black;">亦</span>看到将会围绕着<span style="color: black;">更加多</span>其他并情化策略的优化工作。今天我<span style="color: black;">重点</span>是从云平台的<span style="color: black;">方向</span>看,<span style="color: black;">咱们</span>需要怎么样的性能优化工作,侧重分享的是跟框框架无关的底层的<span style="color: black;">有些</span>优化。<span style="color: black;">然则</span><span style="color: black;">咱们</span><span style="color: black;">亦</span><span style="color: black;">能够</span>看到<span style="color: black;">非常多</span>跟框架<span style="color: black;">或</span><span style="color: black;">乃至</span>模型强强耦合的<span style="color: black;">有些</span>性能工作,<span style="color: black;">例如</span>说Deep Speed<span style="color: black;">或</span>Horovod。这<span style="color: black;">实质</span>上<span style="color: black;">表率</span>的是两种范式,<span style="color: black;">亦</span><span style="color: black;">能够</span>引出<span style="color: black;">非常多</span>讨论,<span style="color: black;">例如</span>说<span style="color: black;">咱们</span>在做<span style="color: black;">有些</span>上性能方面的设计的时候,是应该<span style="color: black;">尽可能</span>做到平台无关,还是<span style="color: black;">针对</span>某个优化应该推出一个新的框架,<span style="color: black;">或</span>更进一步,更好的性能是不是一个AI框架的核心竞争力,我想这些都是<span style="color: black;">有些</span>非常有意思的问题。今天我的分享就到<span style="color: black;">这儿</span>,<span style="color: black;">倘若</span>有问题的话<span style="color: black;">亦</span>欢迎提出,谢谢<span style="color: black;">大众</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">今天的分享就到<span style="color: black;">这儿</span>,谢谢大家。</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在文末分享、点赞、在看,给个3连击呗~</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">分享嘉宾:</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">关于<span style="color: black;">咱们</span>:</span></strong></p><span style="color: black;"><strong style="color: blue;"><span style="color: black;">DataFun:</span></strong><span style="color: black;">专注于大数据、人工智能技术应用的分享与交流。发起于2017年,在北京、上海、深圳、杭州等城市举办超过100+线下和100+线上沙龙、论坛及峰会,已邀请近1000位专家和学者参与分享。其公众号 DataFunTalk 累计生产原创<span style="color: black;">文案</span>500+,百万+阅读,13万+<span style="color: black;">精细</span>粉丝。</span></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">

gao808909 发表于 2024-8-22 16:16:55

大势所趋,用于讽刺一些制作目的就是为了跟风玩梗,博取眼球的作品。

啊呀呀 发表于 2024-9-4 16:10:28

你的话语真是温暖如春,让我心生感激。

4lqedz 发表于 2024-10-13 01:41:26

回顾历史,我们感慨万千;放眼未来,我们信心百倍。

1fy07h 发表于 2024-11-6 10:27:39

回顾历史,我们感慨万千;放眼未来,我们信心百倍。

nqkk58 发表于 昨天 16:15

我完全赞同你的观点,思考很有深度。
页: [1]
查看完整版本: 蓝昶:谷歌分布式设备学习优化实践