港中文提出 EdgeViT | 超越MobileViT与MobileNet,实现Transformer在CPU上实时
<img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKibm74R6nYRyibZKrWe571tmro1FBKQOeYS3bUwnxHhWdoKvyJDeVufow/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;">在计算机视觉<span style="color: black;">行业</span>,基于</span><span style="color: black;">Self-attention</span><span style="color: black;">的模型(如(</span><span style="color: black;">ViTs</span><span style="color: black;">))<span style="color: black;">已然</span><span style="color: black;">作为</span></span><span style="color: black;">CNN</span><span style="color: black;">之外的一种极具竞争力的架构。尽管越来越强的变种<span style="color: black;">拥有</span>越来越高的识别精度,但<span style="color: black;">因为</span></span><span style="color: black;">Self-attention</span><span style="color: black;">的二次<span style="color: black;">繁杂</span>度,现有的</span><span style="color: black;">ViT</span><span style="color: black;">在计算和模型<span style="color: black;">体积</span>方面都有较高的<span style="color: black;">需求</span>。</span><span style="color: black;">虽然之前的</span><span style="color: black;">CNN</span><span style="color: black;">的<span style="color: black;">有些</span>成功的设计<span style="color: black;">选取</span>(例如,卷积和分层结构)<span style="color: black;">已然</span>被引入到<span style="color: black;">近期</span>的</span><span style="color: black;">ViT</span><span style="color: black;">中,但它们仍然不足以满足移动设备有限的计算资源<span style="color: black;">需要</span>。这<span style="color: black;">促进</span>人们<span style="color: black;">近期</span>尝试<span style="color: black;">研发</span>基于最先进的</span><span style="color: black;">MobileNet-v2</span><span style="color: black;">的轻型</span><span style="color: black;">MobileViT</span><span style="color: black;">,但</span><span style="color: black;">MobileViT</span><span style="color: black;">与</span><span style="color: black;">MobileNet-v2</span><span style="color: black;">仍然存在性能差距。</span><span style="color: black;">在这项工作中,作者进一步推进这一<span style="color: black;">科研</span>方向,引入了</span><span style="color: black;">EdgeViTs</span><span style="color: black;">,一个新的轻量级</span><span style="color: black;">ViTs</span><span style="color: black;">家族,<span style="color: black;">亦</span>是首次使基于</span><span style="color: black;">Self-attention</span><span style="color: black;">的视觉模型在准确性和设备效率之间的权衡中达到最佳轻量级</span><span style="color: black;">CNN</span><span style="color: black;">的性能。</span><span style="color: black;">这是<span style="color: black;">经过</span>引入一个基于</span><span style="color: black;">Self-attention</span><span style="color: black;">和卷积的最优集成的高成本的local-global-local(</span><span style="color: black;">LGL</span><span style="color: black;">)信息交换瓶颈来实现的。<span style="color: black;">针对</span>移动设备专用的<span style="color: black;">评定</span>,不依赖于不准确的</span><span style="color: black;">proxies</span><span style="color: black;">,如</span><span style="color: black;">FLOPs</span><span style="color: black;">的数量或</span><span style="color: black;">参数</span><span style="color: black;">,而是采用了一种直接关注设备延迟和能源效率的实用<span style="color: black;">办法</span>。</span><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKbdeWZaU1Hn94Ou5xOibiceuLmQlZfstuU4DGOiaJibeObp5KkLsGVdxPCQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;">在图像<span style="color: black;">归类</span>、<span style="color: black;">目的</span>检测和语义分割方面的<span style="color: black;">海量</span>实验验证了</span><span style="color: black;">EdgeViTs</span><span style="color: black;">在移动硬件上的准确性-效率权衡方面与最先进的<span style="color: black;">有效</span></span><span style="color: black;">CNN</span><span style="color: black;">和</span><span style="color: black;">ViTs</span><span style="color: black;">相比<span style="color: black;">拥有</span>更高的性能。<span style="color: black;">详细</span>地说,</span><span style="color: black;">EdgeViTs</span><span style="color: black;">在<span style="color: black;">思虑</span>精度-延迟和精度-能量权衡时是帕累托最优的,几乎在所有<span style="color: black;">状况</span>下都实现了对其他</span><span style="color: black;">ViT</span><span style="color: black;">的超越,并<span style="color: black;">能够</span>达到最<span style="color: black;">有效</span></span><span style="color: black;">CNN</span><span style="color: black;">的性能。</span><strong style="color: blue;"><span style="color: black;">1</span></strong><strong style="color: blue;">EdgeViTs</strong><h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">1.1 总体架构</span></strong></span></h3><span style="color: black;">为了设计适用于移动/边缘设备的轻量级</span><span style="color: black;">ViT</span><span style="color: black;">,作者采用了<span style="color: black;">近期</span></span><span style="color: black;">ViT</span><span style="color: black;">变体中<span style="color: black;">运用</span>的分层金字塔结构(图2(a))。</span><span style="color: black;">Pyramid Transformer</span><span style="color: black;">模型<span style="color: black;">一般</span>在<span style="color: black;">区别</span><span style="color: black;">周期</span>降低了空间分辨率<span style="color: black;">同期</span><span style="color: black;">亦</span>扩展了通道维度。<span style="color: black;">每一个</span><span style="color: black;">周期</span>由多个基于</span><span style="color: black;">Transformer Block</span><span style="color: black;">处理相同形状的张量,类似</span><span style="color: black;">ResNet</span><span style="color: black;">的层次设计结构。</span><span style="color: black;">基于</span><span style="color: black;">Transformer Block</span><span style="color: black;">严重依赖于<span style="color: black;">拥有</span>二次<span style="color: black;">繁杂</span>度的</span><span style="color: black;">Self-attention</span><span style="color: black;">操作,其<span style="color: black;">繁杂</span>度与视觉特征的空间分辨率呈2次关系。<span style="color: black;">经过</span>逐步聚集空间</span><span style="color: black;">Token</span><span style="color: black;">,</span><span style="color: black;">Pyramid Transformer</span><span style="color: black;">可能比各向同性模型(</span><span style="color: black;">ViT</span><span style="color: black;">)更有效。</span><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKnOy6PLEyavrUDCZGSsstxLUlXsGVicH3YO30cEp6GelRqXrJOWWB4Pw/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;">在这项工作中,作者深入到</span><span style="color: black;">Transformer Block</span><span style="color: black;">,并引入了一个比较划算的</span><span style="color: black;">Bottlneck</span><span style="color: black;">,</span><span style="color: black;">Local-Global-Local</span><span style="color: black;">(</span><span style="color: black;">LGL</span><span style="color: black;">)(图2(b))。</span><span style="color: black;">LGL</span><span style="color: black;"><span style="color: black;">经过</span>一个稀疏<span style="color: black;">重视</span>力模块进一步减少了</span><span style="color: black;">Self-attention</span><span style="color: black;">的开销(图2(c)),实现了更好的准确性-延迟平衡。</span>
<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;">1.2 Local-Glo</strong><strong style="color: blue;">bal-Local bottleneck</strong></span></h3><span style="color: black;">Self-attention</span><span style="color: black;">已被证明是非常有效的学习全局信息或长距离空间依赖性的<span style="color: black;">办法</span>,这是视觉识别的关键。另一方面,<span style="color: black;">因为</span>图像<span style="color: black;">拥有</span>高度的空间冗余(例如,附近的</span><span style="color: black;">Patch</span><span style="color: black;">在语义上是<span style="color: black;">类似</span>的),将<span style="color: black;">重视</span>力集中到所有的空间</span><span style="color: black;">Patch</span><span style="color: black;">上,即使是在一个下采样的特征映射中,<span style="color: black;">亦</span>是低效的。</span><span style="color: black;"><span style="color: black;">因此呢</span>,与以前在<span style="color: black;">每一个</span>空间位置执行</span><span style="color: black;">Self-attention</span><span style="color: black;">的</span><span style="color: black;">Transformer Block</span><span style="color: black;">相比,</span><span style="color: black;">LGL Bottleneck</span><span style="color: black;">只对输入</span><span style="color: black;">Token</span><span style="color: black;">的子集计算</span><span style="color: black;">Self-attention</span><span style="color: black;">,但支持完整的空间交互,如在标准的</span><span style="color: black;">Multi-Head Self-attention</span><span style="color: black;">(MHSA)中。既会减少</span><span style="color: black;">Token</span><span style="color: black;">的<span style="color: black;">功效</span>域,<span style="color: black;">同期</span><span style="color: black;">亦</span><span style="color: black;">保存</span>建模全局和局部上下文的底层信息流。</span><span style="color: black;">为了实现这一点,作者将</span><span style="color: black;">Self-attention</span><span style="color: black;">分解为连续的模块,处理<span style="color: black;">区别</span>范围内的空间</span><span style="color: black;">Token</span><span style="color: black;">(图2(b))。</span><span style="color: black;"><span style="color: black;">这儿</span>引入了3种有效的操作:</span><span style="color: black;">Local aggregation:仅集成来自局部近似</span><span style="color: black;">Token</span><span style="color: black;">信号的局部聚合</span><span style="color: black;">Global sparse attention:建模一组<span style="color: black;">表率</span>性</span><span style="color: black;">Token</span><span style="color: black;">之间的<span style="color: black;">长时间</span>关系,其中<span style="color: black;">每一个</span></span><span style="color: black;">Token</span><span style="color: black;">都被视为一个局部窗口的<span style="color: black;">表率</span>;</span><span style="color: black;">Local propagation:将<span style="color: black;">拜托</span>学习到的全局上下文信息扩散到<span style="color: black;">拥有</span>相同窗口的非<span style="color: black;">表率</span></span><span style="color: black;">Token</span><span style="color: black;">。</span><span style="color: black;">将这些结合起来,</span><span style="color: black;">LGL Bottleneck</span><span style="color: black;">就能够以低计算成本在同一特征映射中的任何一对</span><span style="color: black;">Token</span><span style="color: black;">之间进行信息交换。下面将<span style="color: black;">仔细</span>说明每一个<span style="color: black;">构成</span>部分:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKLyrFW8tMsaWib7bd8ZpicHNDVJJhEheB0AibWUibGKmZz3knEQc6mqXdeg/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;">
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">1、Local aggregation</p>
</span><span style="color: black;"><span style="color: black;">针对</span><span style="color: black;">每一个</span></span><span style="color: black;">Token</span><span style="color: black;">,利用</span><span style="color: black;">Depth-wise</span><span style="color: black;">和</span><span style="color: black;">Point-wise</span><span style="color: black;">卷积在<span style="color: black;">体积</span>为k×k的局部窗口中聚合信息(图3(a))。</span><span style="color: black;">2、Global sparse attention</span><span style="color: black;">对均匀分布在空间中的稀疏<span style="color: black;">表率</span>性</span><span style="color: black;">Token</span><span style="color: black;">集进行采样,<span style="color: black;">每一个</span>r×r窗口有一个<span style="color: black;">表率</span>性</span><span style="color: black;">Token</span><span style="color: black;">。<span style="color: black;">这儿</span>,r<span style="color: black;">暗示</span>子样本率。<span style="color: black;">而后</span>,只对这些被<span style="color: black;">选取</span>的</span><span style="color: black;">Token</span><span style="color: black;">应用</span><span style="color: black;">Self-attention</span><span style="color: black;">(图3(b))。这与所有现有的</span><span style="color: black;">ViTs</span><span style="color: black;"><span style="color: black;">区别</span>,在那里,所有的空间</span><span style="color: black;">Token</span><span style="color: black;">都<span style="color: black;">做为</span></span><span style="color: black;">Self-attention</span><span style="color: black;">计算中的query被<span style="color: black;">触及</span>到。</span><span style="color: black;">3、Local propagation</span><span style="color: black;"><span style="color: black;">经过</span>转置卷积将<span style="color: black;">表率</span>性</span><span style="color: black;">Token</span><span style="color: black;">中编码的全局上下文信息传播到它们的相邻的</span><span style="color: black;">Token</span><span style="color: black;">中(图3(c))。</span><span style="color: black;"><span style="color: black;">最后</span>,</span><span style="color: black;">LGL bottleneck</span><span style="color: black;"><span style="color: black;">能够</span>表达为:</span><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbK8fNVSt3CpPSHeuRibBh55RbbjvLfIibAyxmzibQkeWknRoibIJjh7D5tfQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><span style="color: black;">这儿</span>,<span style="color: black;">暗示</span>输入张量。</span><span style="color: black;">Norm</span><span style="color: black;">是</span><span style="color: black;">Layer Normalization</span><span style="color: black;">操作。</span><span style="color: black;">LocalAgg</span><span style="color: black;"><span style="color: black;">暗示</span>局部聚合算子,</span><span style="color: black;">FFN</span><span style="color: black;">是一个双层感知器。</span><span style="color: black;">GlobalSparseAttn</span><span style="color: black;">是全局稀疏</span><span style="color: black;">Self-attention</span><span style="color: black;">。</span><span style="color: black;">LocalProp</span><span style="color: black;">是局部传播运算符。为简单起见,<span style="color: black;">这儿</span>省略了位置编码。<span style="color: black;">重视</span>,所有这些操作符都<span style="color: black;">能够</span><span style="color: black;">经过</span>在标准深度学习平台上的常用和高度优化的操作来实现。<span style="color: black;">因此呢</span>,</span><span style="color: black;">LGL bottleneck</span><span style="color: black;"><span style="color: black;">针对</span>实现是友好的。</span><span style="color: black;">Pytorch实现</span><span style="color: black;"><span style="color: black;"><span style="color: black;">class</span> <span style="color: black;">LocalAgg()</span>:</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">__init__</span><span style="color: black;">(self, dim)</span>:</span> self.conv1 = Conv2d(dim, dim, <span style="color: black;">1</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">)</p> self.conv2 = Conv2d(im, dim, <span style="color: black;">3</span>, padding=<span style="color: black;">1</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">, groups=dim)</p> self.conv3 = Conv2d(dim, dim, <span style="color: black;">1</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> self.norm1 = BatchNorm2d(dim)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> self.norm2 = BatchNorm2d(dim)</p> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">forward</span><span style="color: black;">(self, x)</span>:</span> <span style="color: black;">
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">"""</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> = x.shape</p> """
</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.conv1(self.norm1(x))</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.conv2(x)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.conv3(self.norm2(x))</p> <span style="color: black;">return</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x</p><span style="color: black;"><span style="color: black;">class</span> <span style="color: black;">GlobalSparseAttn()</span>:</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">__init__</span><span style="color: black;">(self, dim, sample_rate, scale)</span>:</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> self.scale = scale</p> self.qkv = Linear(dim, dim * <span style="color: black;">3</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">)</p> self.sampler = AvgPool2d(<span style="color: black;">1</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">, stride=sample_rate)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> kernel_size=sr_ratio</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">self.LocalProp = ConvTranspose2d(dim, dim, kernel_size, stride=sample_rate, groups=dim</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> )</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> self.norm = LayerNorm(dim)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> self.proj = Linear(dim, dim)</p> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">forward</span><span style="color: black;">(self, x)</span>:</span> <span style="color: black;">
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">"""</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> = x.shape</p> """
</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.sampler(x)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> q, k, v = self.qkv(x)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> attn = q @ k * self.scale</p> attn = attn.softmax(dim=<span style="color: black;">-1</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = attn @ v</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.LocalProp(x)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">x = self.proj(self.norm(x))</p> <span style="color: black;">return</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x</p><span style="color: black;"><span style="color: black;">class</span> <span style="color: black;">DownSampleLayer()</span>:</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">__init__</span><span style="color: black;">(self, dim_in, dim_out, downsample_rate)</span>:</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">self.downsample = Conv2d(dim_in, dim_out, kernel_size=downsample_rate, stride=</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> downsample_rate)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> self.norm = LayerNorm(dim_out)</p> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">forward</span><span style="color: black;">(self, x)</span>:</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.downsample(x)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.norm(x)</p> <span style="color: black;">return</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x</p><span style="color: black;"><span style="color: black;">class</span> <span style="color: black;">PatchEmbed()</span>:</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">__init__</span><span style="color: black;">(self, dim)</span>:</span> self.embed = Conv2d(dim, dim, <span style="color: black;">3</span>, padding=<span style="color: black;">1</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">, groups=dim)</p> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">forward</span><span style="color: black;">(self, x)</span>:</span> <span style="color: black;">return</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x + self.embed(x)</p><span style="color: black;"><span style="color: black;">class</span> <span style="color: black;">FFN()</span>:</span> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">__init__</span><span style="color: black;">(self, dim)</span>:</span> self.fc1 = nn.Linear(dim, dim*<span style="color: black;">4</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">)</p> self.fc2 = nn.Linear(dim*<span style="color: black;">4</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">, dim)</p> <span style="color: black;"><span style="color: black;">def</span> <span style="color: black;">forward</span><span style="color: black;">(self, x)</span>:</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.fc1(x)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = GELU(x)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x = self.fc2(x)</p> <span style="color: black;">return</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> x</p>
</span><span style="color: black;">与其他经典结构的对比</span><span style="color: black;">LGL bottleneck</span><span style="color: black;">与<span style="color: black;">近期</span>的</span><span style="color: black;">PVTs</span><span style="color: black;">和</span><span style="color: black;">Twins-SVTs</span><span style="color: black;">模型有一个<span style="color: black;">类似</span>的<span style="color: black;">目的</span>,这些模型试图减少</span><span style="color: black;">Self-attention</span><span style="color: black;">开销。然而,它们在核心设计上有所<span style="color: black;">区别</span>。</span><span style="color: black;">PVTs</span><span style="color: black;">执行</span><span style="color: black;">Self-attention</span><span style="color: black;">,其中</span><span style="color: black;">Key</span><span style="color: black;">和</span><span style="color: black;">Value</span><span style="color: black;">的数量<span style="color: black;">经过</span></span><span style="color: black;">strided-convolutions</span><span style="color: black;">减少,而</span><span style="color: black;">Query</span><span style="color: black;">的数量保持不变。换句话说,</span><span style="color: black;">PVTs</span><span style="color: black;">仍然在<span style="color: black;">每一个</span>网格位置上执行</span><span style="color: black;">Self-attention</span><span style="color: black;">。</span><span style="color: black;">在这项工作中,作者质疑位置级</span><span style="color: black;">Self-attention</span><span style="color: black;">的必要性,并探索由</span><span style="color: black;">LGL bottleneck</span><span style="color: black;">所支持的信息交换在多大程度上<span style="color: black;">能够</span>近似于标准的MHSA。</span><span style="color: black;">Twins-SVTs</span><span style="color: black;">结合了</span><span style="color: black;">Local-Window Self-attention</span><span style="color: black;">和</span><span style="color: black;">PVTs</span><span style="color: black;">的</span><span style="color: black;">Global Pooled Attention</span><span style="color: black;">。这<span style="color: black;">区别</span>于</span><span style="color: black;">LGL bottleneck</span><span style="color: black;">的混合设计,</span><span style="color: black;">LGL bottleneck</span><span style="color: black;"><span style="color: black;">同期</span><span style="color: black;">运用</span>分布在一系列局部-全局-局部操作中的</span><span style="color: black;">Self-attention</span><span style="color: black;">操作和卷积操作。</span><span style="color: black;">如实验所示(表2和表3)所示,</span><span style="color: black;">LGL bottleneck</span><span style="color: black;">的设计在模型性能和计算开销(如延迟、能量消耗等)之间实现了更好的权衡。</span>
<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;">1.3 结构</strong><strong style="color: blue;">变体</strong></span></h3><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKsnYxGZH8gUD9REG17Xx5ohToS7Avic3KCKibt5pssBLIaG6ZI2Ek7y1Q/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"><strong style="color: blue;"><span style="color: black;">2</span></strong><strong style="color: blue;">实验</strong>
<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;">2.1 Image</strong></span><span style="color: black;"><strong style="color: blue;">NeT精度SoTA</strong></span></h3><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKibficyATPBoQkT9bBabWfLkqRtiapyial3tRhrN91Fj2971EgWDWZKuVxw/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">
<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;">2.2 实时性</strong></span><span style="color: black;"><strong style="color: blue;">与精度对比</strong></span></h3><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKMY4LwOX6bHeJXzQNfgCsBSXo8I9oShcNbAic14OQ8CJGT2nN9VkObjg/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">
<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;">2.3 <span style="color: black;">目的</span></strong></span><span style="color: black;"><strong style="color: blue;">检测任务</strong></span></h3><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKUMrnSJmZgcd5hn8gb6wO18bCzbWGn13V2Q8iaYX40ZNC0vk3tqoepDQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">
<h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;">2.4 语义</strong></span><span style="color: black;"><strong style="color: blue;">分割任务</strong></span></h3><img src="https://mmbiz.qpic.cn/mmbiz_png/5ooHoYt0tgkox7WbPFibpZlESSsgx9vbKQ8nKXEkYUOM8jvwb9QWV7cFPt57HvY3IicjryzzQ40ydg0AEK9sXyWw/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;">
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">参考资料:</span></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">.EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/cNFA8C0uVPswlj8gwvbAQ7hlxgNw4Uib7BEaFBkSGnLXamfMMWyAibfIE1eXSTdnXUwPia9fnsJ3Dol1q0OfbWsibA/640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
你的见解独到,让我受益匪浅,非常感谢。
页:
[1]