【BigNAS】2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models-论文阅读

BigNAS

2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models

来源:ChenBong 博客园

  • Institute:Google Brain
  • Author:Jiahui Yu
  • GitHub:/
  • Citation: 20+

Introduction

训练supernet,直接采样子网即可直接部署

image-20210131170347160


Motivation

  • 2019-ICLR-Slimmable Neural Networks
    • scale 维度:
      • channel num(network-wise,top n index)
    • 每个 batch:
      • channel num(固定的4个:0.25x,0.5x,0.75x,1.0x)

  • 2019-ICCV-US-Net
    • scale维度:
      • channel num(network-wise,top n index)
    • 每个 batch:
      • channel num(随机的4个:min,random×2,max)
    • inplace distillation: $CE(max, hat y), CE(min, y_{max}), CE(random_{1,2}, y_{max}) $

  • 2020-ECCV-MutualNet
    • scale 维度:
      • input resolution(network-wise)
      • channel num(network-wise,top n index)
    • 每个 batch:
      • input resolution(相同的1个crop,resize成4个resolution)
      • channel num(随机的4个:min,random×2,max)
    • inplace distillation: (CE(max, hat y), KLDiv(random_{1,2}, y_{max}), KLDiv(min, y_{max}))

  • 2020-ECCV-RS-Net
    • scale维度:
      • input resolution(network-wise)
    • 每个 batch:
      • input resolution(相同的1个crop,resize成S个resolution:(S_1>S_2>...>S_N)
    • inplace distillation: $CE(ensemble, hat y), CE(S_1, ensemble), CE(S_2, S_1)... $

  • 2020-ICLR-Once for All:
    • scale维度:
      • input resolution(network-wise)
      • channel num(layer-wise,top L1 select)
      • layer num(stage-wise,top n select)
      • kernel size(layer-wise,center+layer's fc)
    • train full
    • KD: (CE(ps_1, full), CE(ps_2, ps_1)...)

  • 2020-ECCV-BigNAS
    • scale维度:
      • input resolution(network-wise)
      • channel num(layer-wise,top n select)
      • layer num(stage-wise,top n select)
      • kernel size(layer-wise,center)
    • 每个 batch:
      • input resolution(相同的1个crop,resize成4种resolution)
      • 每个batch sample 3个child model (full, rand1, rand2)
    • inplace distillation: (CE(full, hat y), KLDiv(random_{1,2}, y_{full}))

image-20210131185604401


Contribution


Method

Training a High-Quality Single-Stage Model

Sandwich Rule (previous work)

Inplace Distillation (previous work)

Batch Norm Calibration (previous work)


Initialization

  1. He Initialization: both small (left) and big (right) child models drops to zero after a few thousand training steps during the learning rate warming-up.
  2. The single-stage model is able to converge when we reduce the learning rate to the 30%.
  3. If the initialization is modified according to Section 3.1, the model learns much faster at the beginning of the training (shown in Figure 4), and has better performance at the end of the training (shown in Figure 5).

Section 3.1:

we initialize the output of each residual block (before skip connection) to an allzeros tensor by setting the learnable scaling coefficient γ = 0 in the last Batch Normalization [20] layer of each residual block.

We also add a skip connection in each stage transition when either resolutions or channels differ (using 2 × 2 average pooling and/or 1 × 1 convolution if necessary) to explicitly construct an identity mapping.

image-20210131190339851


image-20210131190402302


Convergence Behavior

small child models converge slower and need more training

==> exponentially decaying with constant ending.

image-20210131190716917


image-20210131190433279


Regularization

we compare the effects of the regularization (weight decay and dropout) between two rules:

  1. applying regularization on all child models
  2. applying regularization only on the full network

image-20210131190625218


Coarse-to-fine Architecture Selection

scale维度:

  • input resolution(network-wise)
  • channel num(layer-wise,top n select)
  • layer num(stage-wise,top n select)
  • kernel size(layer-wise,center)

we pre-define :

five input resolutions (network-wise, {192, 224, 256, 288, 320}),

four depth configurations (stage-wise),

two channel configurations (stage-wise),

four kernel size configurations (stage-wise)

image-20210131190513738


Experiments

Cost

8×8 TPUv3

To train a single-stage model, it roughly takes 36 hours.


ImageNet

image-20210131190248536


ablation study

Finetuning child models

image-20210131193548125


Training child from scratch

image-20210131193556473


Conclusion

Summary

  • 将one-shot超网训练做到比较完善的一个工作,相比之下ofa还需要多个stage,分阶段蒸馏/fine-tune,不够one-shot

  • 将无需 fine-tune 的 motivation 贯彻到底

  • 整个pipeline之所以work的主要原因应该是 top n select + self KD 的超网训练方式,其他的部分都是一两个点的小修小补的工作

  • 训练开销也在可接受的范围内(和once for all 比起来)

  • fine-tune和train from scratch child model都无法再提高模型性能,有点反直觉,说明这个超网训练的pipeline确实对child model非常有益?

  • 搜索空间似乎经过精心设计

  • 没有开源


To Read

Reference

原文地址:https://www.cnblogs.com/chenbong/p/14353937.html