Faster RCNN

Girshick R., Donahue J., Darrel T. and Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation Tech report. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014

Girshick R. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV), 2015.

Ren S., He K., Girshick R. and Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.

Lin T., Doll'{a}r, Girshick R., He K., Hariharan B. and Belongie S. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

最近看了看目标检测的论文和Pytorch的官方代码, 其中所用到的技巧和代码量实在是过于庞大了, 我严重怀疑我之后能否顺利回忆起这部分的流程, 故而做下面一个简单的梳理.

输入: 列表形式的图片\(x\)和目标\(y\)(boxes, labels, area, image_id, iscrowd);
经过transform转换:
1. normalize: \(\frac{x - \mu}{\sigma}\);
2. resize: 使得图片具有相近的宽高比, 同时将boxes也作相应的放缩;
3. pad: 将图片嵌入到同样大小的'图片'中;
4. 于是得到\(x \in \mathbb{R}^{N \times 3 \times H \times W}\);
经过普通的encoder, 如resnet50得到其不同的stage的特征:

\[[z_1, z_2, z_3, z_4], \quad z_i \in \mathbb{R}^{N \times D_i \times H_i \times W_i}; \]

经过FPN处理:
1. 首先通过\(1\times 1\)的卷积核将不同的\(D_i\)变成同样的\(D\)(如256);
2. \(z_4\)通过插值扩大为其两倍得到\(z_4'\):
\[z_3 = z_3 + z_4', \]
类似地
\[z_2 = z_2 + z_3',\\ z_1 = z_1 + z_2', \]
1. 均经过\(3\times 3\)的卷积核得到新的特征\(z_1, z_2, z_3, z_4\);
2. 通过一个extra block (通常是池化层)作用于\(z_4\)得到\(z_5\);
3. 最终的features:
\[[z_1, z_2, z_3, z_4, z_5]; \]
通过Anchor生成器生成proposal boxes:
1. 对于\(z_1, z_2, z_3, z_4, z_5\)分别生成area为\(32^2, 64^2, 128^2, 256^2, 512^2\), 以及宽高比为\(0.5, 1., 2\)的三类初始anchors, 即每个level的特征有\(K=3\)种不同比例的anchors(其各自的初始面积是不同的);
2. 将上述的初始anchors扩展至每个location, 即变换中心, 则每个level共有\(H_l \times W_l \times K\)个anchors(单个图片);
通过RPN head 计算特征为目标的概率(二分类)以及每个anchors的偏移量\(\delta\):
1. 通过[3, 1, 1]的卷积核对\(z\)进行预处理;
2. 通过[1, 1, 0]的卷积核得到形为\(N\times K \times H_l \times W_l\)的logits;
3. 通过[1, 1, 0]的卷积核得到形为\(N \times 4K \times H_l \times W_l\)的\(\hat{\delta}\);
RPN:
1. 通过
\[\hat{P}_x = A_x + \hat{\delta}_x A_w \\ \hat{P}_y = A_y + \hat{\delta}_y A_h \\ \hat{P}_w = A_w \times \exp(\hat{\delta}_w) \\ \hat{P}_h = A_h \times \exp(\hat{\delta}_h) \\ \]
得到proposals (注意这里\((x, y)\)表示中心, 且proposals与下面的损失无关, 用于后续的部分);
2. 对生成的proposals进行删选: 保留高置信度的, 删除过小的proposals, 进行非极大值抑制等;
3. 为每个Anchors匹配合适的ground truth boxes, 并通过下式计算真实的\(\delta\):
\[\delta_x = (G_x - A_x) / A_w \\ \delta_y = (G_y - A_y) / A_h \\ \delta_w = \log (G_w / A_w) \\ \delta_h = \log (G_h / A_h); \]
1. 采样一批\(\delta', \delta\) (保证正例和负例有合适的比例);
2. 计算与logits有关的二分类损失和\(\delta', \delta\)之间的回归损失(smmoth_l1_loss);
为上一步提取的proposals进行打上类别标签\(\{0, 1, 2,\cdots\}\)(0表示背景), 采样一批样本, 即:
- proposals: List[Tensor], \(N * B \times 4\);
- labels: List[Tensor], \(N * B\);
- regression_deltas: List[Tensor], \(N * B \times 4\);
将proposal所对应的区域的特征提炼为\(h \times w\)的特征(比如常见的\(7\times 7\)), 这是roi pooling (或者更精准的roi align的工作), 故共\((N\times B) \times C \times h \times w\)的特征;
这些特征经过双层的MLP提炼, 在经过预测器进行分类得到:
- logits: Tensor, \((N \times B) \times \mathrm{classes}\);
- \(\hat{\delta}\): \((N \times B) \times 4\mathrm{classes}\);
如果是训练, 则和RPN类似可以得到两个损失(分类损失和回归损失);
如果是推理, 则:
1. \(P, \hat{\delta}\)得到预测的bounding boxes;
2. 得到相应的置信度;
3. 对bounding boxes进行一些处理(舍去背景, 低置信度等);
4. 非极大值抑制处理