多尺度注意力机制的语义分割

Using Multi-Scale Attention for Semantic Segmentation

在自动驾驶、医学成像甚至变焦虚拟背景中，有一项重要的技术是常用的：语义分割。这是将图像中的像素标记为属于N个类（N是任意数量的类）之一的过程，其中类可以是汽车、道路、人或树等。对于医学图像，类对应于不同的器官或解剖结构。

NVIDIA是一种应用广泛的语义分割技术。还认为，改进语义分割的技术也可能有助于改进许多其密集预测任务，如光流预测（预测物体运动）、图像超分辨率等。开发了一种新的语义分割方法，在两个共同的基准上实现了创纪录的最新结果：城市景观Cityscapes数据集和地图景观，如下表所示。IOU是union上的交集，是一种描述语义预测准确性的度量。

在城市景观Cityscapes数据集中，这种方法在测试集上达到85.4个IOU，与其条目相比有了很大的改进，因为这些分数彼此非常接近。

Table 1. Results on Cityscapes test set.
与使用集成实现58.7的次优结果相比，使用Mapillary，在使用单个模型的验证集上实现61.1 IOU。

Table 2. Results on Mapillary Vistas semantic segmentation validation set.

Research journey

为了开发这种新方法，考虑了图像的哪些特定区域需要改进。图1显示了当前语义分割模型的两种最大的失败模式：细节错误和类混淆。

Figure 1. Illustration of common failures modes for semantic segmentation as they relate to inference scale. In the first row, the thin posts are inconsistently segmented in the scaled down (0.5x) image, but better predicted in the scaled-up (2.0x) image. In the second row, the large road / divider region is better segmented at lower resolution (0.5x).

在这个例子中，存在两个问题：细节和类混淆。

在2倍尺度预测中，第一张图片中的柱子的精细细节得到了最好的解决，但是在0.5倍尺度下，分辨率很差。

与中值分割相比，道路的粗预测在0.5倍尺度下比在2倍尺度下（存在等级混淆）得到更好的解决。

解决方案在这两个问题上都表现得更好，类混淆几乎消失，对精细细节的预测更加平滑和一致。

在确定了这些故障模式之后，该团队试验了许多不同的策略，包括不同的网络主干模型（例如，WiderResnet-38、EfficientNet-B4、Xception-71），以及不同的分段解码器（例如，DeeperLab）。决定采用HRNet作为网络主干，RMI作为主要的损耗函数。

HRNet被证明非常适合于计算机视觉任务，因为保持了比以前的网络WiderResnet38高2倍的分辨率。RMI损失提供了一种无需借助于条件随机场之类的东西就可以获得结构损失的方法。HRNet和RMI丢失都有助于解决精细细节和类混淆问题。

为了进一步解决主要的失效模式，创新了两种方法：多尺度注意和自动标记。

Multi-scale attention

为了达到最好的效果，在计算机视觉模型中通常使用多尺度推理。通过网络运行多个图像尺度，并将结果与平均池化相结合。

使用平均池化作为组合策略将所有规模视为同等重要。然而，精细细节通常最好在较高的尺度下预测，而大型物体在较低的尺度下预测更好，因为在较低的尺度下，网络的接收场能够更好地理解场景。

学习如何在像素级组合多尺度预测有助于解决这个问题。在这一策略上已有先例，陈的方法关注的规模是最接近的。在陈的方法中，注意力是同时学习所有尺度大小的。称之为显式方法，如图2所示。

Figure 2. The explicit approach of Chen, et al. learns a dense attention mask for a fixed set of scales to combine them to form a final semantic prediction.fusion.

在陈的方法的激励下，提出了一个多尺度注意力模型，该模型还训练预测一个稠密的面具，将多尺度预测结合在一起。然而，在这个方法中，训练了一个相对注意mask面罩，以便在一个尺度和下一个更高的尺度之间进行尝试，如图3所示。称之为层次方法。

Figure 3. Our hierarchical multi-scale attention method. Top: During training, our model learns to predict attention between two adjacent scale pairs. Bottom: Inference is done in a chained/hierarchical manner in order to combine multiple scales of predictions together. Lower scale attention determines the contribution of the next higher scale.

这种方法的主要好处如下：

理论训练花费比陈的方法减少了约4倍。

虽然训练只使用一对尺度进行，但推理是灵活的，可以使用任意数量的尺度进行。

Table 3. Comparison of the hierarchical multi-scale attention method vs. other approaches on the Mapillary validation set. The network architecture is DeepLab V3+ with a ResNet-50 trunk. Eval scales: scales used for multi-scale evaluation. FLOPS: the relative amount of flops consumed by the network for training. This method achieves the best validation score, but with only a moderate cost as compared to the explicit approach.

图4显示了方法的一些例子，以及学习到的注意面罩。对于左边图片中的细条，0.5x预测的关注度很低，而2.0x尺度预测的关注度很高。相反，对于右侧图像中非常大的道路/分隔带区域，注意机制学习如何最大程度地利用较低的尺度（0.5x），而更少地利用错误的2.0x预测。

Figure 4. Semantic and attention predictions at every scale level for two different scenes. The scene on the left illustrates a fine detail problem while the scene on the right illustrates a large region segmentation problem. A white color for attention indicates a high value (close to 1.0). The attention values for a given pixel across all scales sums to 1.0. Left: The thin road-side posts are best resolved at 2x scale, and the attention successfully attends more to that scale than other scales, as evidenced by the white color for the posts in the 2x attention image. Right: The large road/divider region is best predicted at 0.5x scale, and the attention does successfully focus most heavily on the 0.5x scale for that region.

Auto-labelling

利用Cityscapes城市景观数据集改善语义分割结果的一种常用方法是利用大量的粗数据集。这个数据大约是基准精细数据的7倍。以前的SOTA方法对城市景观使用的是粗略的标签，要么使用粗略的数据对网络进行预训练，要么将其与精细的数据混合。

然而，粗糙的标签是一个挑战，因为是噪音和不精确的。真值粗标签如图5所示为“原始粗标签”。

Figure 5. Example of our auto-generated coarse image labels. Auto-generated coarse labels (right) provide finer detail of labeling than the original ground truth coarse labels (middle). This finer labeling improves the distribution of the labels since both small and large items are now represented, as opposed to primarily large items.

受最近工作的启发，追求自动标签作为一种手段，以产生更丰富的标签，以填补标签空白的真值粗标签。生成的自动标签显示的细节要比图5中所示的基线粗略标签精细得多。相信这有助于通过填补长尾类数据分布中的空白来进行泛化。

一个简单的使用自动标记的方法，例如使用教师网络中的多类概率来指导学生，在磁盘空间上是非常昂贵的。为20000个粗图像生成标签（19个类的分辨率都是1920×1080）大约需要2 TB的存储空间。如此大的足迹所带来的最大影响将是降低培训绩效。

使用硬阈值方法而不是软阈值方法，将生成的标签占用空间从2TB大大减少到600mb。在这种方法中，概率大于0.5的教师预测是有效的，概率较低的预测被视为“忽略”类。表4显示了将粗数据添加到细数据中以及使用融合数据集训练新学员的好处。

Table 4. The baseline method shown here uses HRNet-OCR as the trunk and our multi-scale attention method. We compare two regimes: training with ground truth fine + ground truth coarse labels to ground truth fine + auto-coarse labels (our method). The regime including the auto-coarse labels improves on the baseline by 0.9 IOU.

Figure 6. Qualitative example of auto-generated coarse image labels.