【BigNAS】2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models-论文阅读

BigNAS

2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models

来源：ChenBong 博客园

Institute：Google Brain
Author：Jiahui Yu
GitHub：/
Citation： 20+

Introduction

训练supernet，直接采样子网即可直接部署

Motivation

2019-ICLR-Slimmable Neural Networks
- scale 维度：
  - channel num（network-wise，top n index）
- 每个 batch：
  - channel num（固定的4个：0.25x，0.5x，0.75x，1.0x）

2019-ICCV-US-Net
- scale维度：
  - channel num（network-wise，top n index）
- 每个 batch：
  - channel num（随机的4个：min，random×2，max）
- inplace distillation： $CE(max, hat y), CE(min, y_{max}), CE(random_{1,2}, y_{max}) $

2020-ECCV-MutualNet
- scale 维度：
  - input resolution（network-wise）
  - channel num（network-wise，top n index）
- 每个 batch：
  - input resolution（相同的1个crop，resize成4个resolution）
  - channel num（随机的4个：min，random×2，max）
- inplace distillation： (CE(max, hat y), KLDiv(random_{1,2}, y_{max}), KLDiv(min, y_{max}))

2020-ECCV-RS-Net
- scale维度：
  - input resolution（network-wise）
- 每个 batch：
  - input resolution（相同的1个crop，resize成S个resolution：(S_1>S_2>...>S_N)）
- inplace distillation： $CE(ensemble, hat y), CE(S_1, ensemble), CE(S_2, S_1)... $

2020-ICLR-Once for All：
- scale维度：
  - input resolution（network-wise）
  - channel num（layer-wise，top L1 select）
  - layer num（stage-wise，top n select）
  - kernel size（layer-wise，center+layer's fc）
- train full
- KD： (CE(ps_1, full), CE(ps_2, ps_1)...)

2020-ECCV-BigNAS
- scale维度：
  - input resolution（network-wise）
  - channel num（layer-wise，top n select）
  - layer num（stage-wise，top n select）
  - kernel size（layer-wise，center）
- 每个 batch：
  - input resolution（相同的1个crop，resize成4种resolution）
  - 每个batch sample 3个child model (full, rand1, rand2)
- inplace distillation： (CE(full, hat y), KLDiv(random_{1,2}, y_{full}))

Contribution

Method

Training a High-Quality Single-Stage Model

Sandwich Rule (previous work)

Inplace Distillation (previous work)

Batch Norm Calibration (previous work)

Initialization

He Initialization: both small (left) and big (right) child models drops to zero after a few thousand training steps during the learning rate warming-up.

The single-stage model is able to converge when we reduce the learning rate to the 30%.

If the initialization is modified according to Section 3.1, the model learns much faster at the beginning of the training (shown in Figure 4), and has better performance at the end of the training (shown in Figure 5).

Section 3.1:

we initialize the output of each residual block (before skip connection) to an allzeros tensor by setting the learnable scaling coefficient γ = 0 in the last Batch Normalization [20] layer of each residual block.

We also add a skip connection in each stage transition when either resolutions or channels differ (using 2 × 2 average pooling and/or 1 × 1 convolution if necessary) to explicitly construct an identity mapping.

Convergence Behavior

small child models converge slower and need more training

==> exponentially decaying with constant ending.

Regularization

we compare the effects of the regularization (weight decay and dropout) between two rules:

applying regularization on all child models

applying regularization only on the full network

Coarse-to-fine Architecture Selection

scale维度：

input resolution（network-wise）
channel num（layer-wise，top n select）
layer num（stage-wise，top n select）
kernel size（layer-wise，center）

we pre-define :

five input resolutions (network-wise, {192, 224, 256, 288, 320}),

four depth configurations (stage-wise),

two channel configurations (stage-wise),

four kernel size configurations (stage-wise)

Experiments

Cost

8×8 TPUv3

To train a single-stage model, it roughly takes 36 hours.

ImageNet

ablation study

Finetuning child models

Training child from scratch

Conclusion

Summary

将one-shot超网训练做到比较完善的一个工作，相比之下ofa还需要多个stage，分阶段蒸馏/fine-tune，不够one-shot
将无需 fine-tune 的 motivation 贯彻到底
整个pipeline之所以work的主要原因应该是 top n select + self KD 的超网训练方式，其他的部分都是一两个点的小修小补的工作
训练开销也在可接受的范围内（和once for all 比起来）
fine-tune和train from scratch child model都无法再提高模型性能，有点反直觉，说明这个超网训练的pipeline确实对child model非常有益？
搜索空间似乎经过精心设计
没有开源

【BigNAS】2020-ECCV-BigNAS Scaling Up Neural Architecture Search with Big Single-Stage Models-论文阅读

BigNAS

Introduction

Motivation

Contribution

Method

Training a High-Quality Single-Stage Model

Sandwich Rule (previous work)

Inplace Distillation (previous work)

Batch Norm Calibration (previous work)

Initialization

Convergence Behavior

Regularization

Coarse-to-fine Architecture Selection

Experiments

Cost

ImageNet

ablation study

Finetuning child models

Training child from scratch

Conclusion

Summary

To Read

Reference