SSD-tensorflow源码阅读

之前写的一篇SSD论文学习笔记因为没保存丢掉了，然后不想重新写，直接进行下一步吧。SSD延续了yolo系列的思路，引入了Faster-RCNN anchor的概念。不同特征层采样，多anchor. SSD源码阅读 https://github.com/balancap/SSD-Tensorflow

ssd_vgg_300.py为主要程序。其中ssd_net函数为定义网络结构。先简单解释下SSD是如何提取feature map的。如下图，利用VGG-16，采用多尺度提取，提取不同卷积层的特征网络。一般为6个，层数大小分别为conv4 ==> 64 x 64，conv7 ==> 32 x 32，conv8 ==> 16 x 16，conv9 ==> 8 x 8，conv10 ==> 4 x 4，conv11 ==> 2 x 2，conv12 ==> 1 x 1。

 1 ###定义网络结构，将不同卷积层存储在end_points中。此部分用了tensorflow.slim模块，类似于keras
       end_points = {}
 2     with tf.variable_scope(scope, 'ssd_300_vgg', [inputs], reuse=reuse):
 3         # Original VGG-16 blocks.
 4         net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
 5         end_points['block1'] = net
 6         net = slim.max_pool2d(net, [2, 2], scope='pool1')
 7         # Block 2.
 8         net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
 9         end_points['block2'] = net
10         net = slim.max_pool2d(net, [2, 2], scope='pool2')
11         # Block 3.
12         net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
13         end_points['block3'] = net
14         net = slim.max_pool2d(net, [2, 2], scope='pool3')
15         # Block 4.
16         net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
17         end_points['block4'] = net
18         net = slim.max_pool2d(net, [2, 2], scope='pool4')
19         # Block 5.
20         net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
21         end_points['block5'] = net
22         net = slim.max_pool2d(net, [3, 3], stride=1, scope='pool5')
23 
24         # Additional SSD blocks.
25         # Block 6: let's dilate the hell out of it!
26         net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
27         end_points['block6'] = net
28         net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)
29         # Block 7: 1x1 conv. Because the fuck.
30         net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
31         end_points['block7'] = net
32         net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)
33 
34         # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
35         end_point = 'block8'
36         with tf.variable_scope(end_point):
37             net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
38             net = custom_layers.pad2d(net, pad=(1, 1))
39             net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
40         end_points[end_point] = net
41         end_point = 'block9'
42         with tf.variable_scope(end_point):
43             net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
44             net = custom_layers.pad2d(net, pad=(1, 1))
45             net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
46         end_points[end_point] = net
47         end_point = 'block10'
48         with tf.variable_scope(end_point):
49             net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
50             net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
51         end_points[end_point] = net
52         end_point = 'block11'
53         with tf.variable_scope(end_point):
54             net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
55             net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
56         end_points[end_point] = net

接下来ssd_multibox_layer 函数为按每一层feature map('block4', 'block7', 'block8', 'block9', 'block10', 'block11')生成不同的anchor进行预测。源码中生成anchor方式与前面所述不太一样。论文中方式提取网络后在不同feature map设置不同大小的anchor，基准的size大小计算方式为，k为不同的特征层取值，比conv4是k为1.Smax=0.9，Smin为0.2. 每个feature map，以基准SIZE生成4-6个不同比例的anchor，比例分别为{1,2,3,1/2,1/3}，其中比例为1时，size为Sk*Sk+1。以输入为300X300尺寸，conv4层的feature map为例。S1=0.2*300=60，选取的比例分别为{1,2,1/2,1‘’}。不同anchor的w分别为{60,60*1.42,60*0.7,112.5}. 但实际函数中不是按这种方法来计算的。接下来分析源码中的计算方式。源码中直接给出了每一层的大小及比例。此函数作用为提取feature map生成预测的位置及类别。此项涉及到提取的feature map数据流通方式。此函数中有两条路线，经过一次batchnorm和卷积，生成类别信息（21*num_anchor*w*h）及位置信息的预测。实际应有三条线？分别生成代码如下：

 1 def ssd_multibox_layer(inputs,
 2                        num_classes,
 3                        sizes,
 4                        ratios=[1],
 5                        normalization=-1,
 6                        bn_normalization=False):
 7     """Construct a multibox layer, return a class and localization predictions.
 8     """
 9     net = inputs
10     if normalization > 0:
11         net = custom_layers.l2_normalization(net, scaling=True)
12     # Number of anchors.
13     num_anchors = len(sizes) + len(ratios) ###4~6,两个sizes代表例为1:1的，sizes代表其他比例的anchor,整体代表一个feature map有几个anchor
14 
15     # Location. 对位置进行预测
16     num_loc_pred = num_anchors * 4
17     loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
18                            scope='conv_loc')
19     loc_pred = custom_layers.channel_to_last(loc_pred)
20     loc_pred = tf.reshape(loc_pred,
21                           tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])
22     # Class prediction. 对类别进行预测
23     num_cls_pred = num_anchors * num_classes
24     cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
25                            scope='conv_cls')
26     cls_pred = custom_layers.channel_to_last(cls_pred)
27     cls_pred = tf.reshape(cls_pred,
28                           tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes])
29     return cls_pred, loc_pred  ###生成每个feature map每个anchor的预测

接下来是利用上式结果生成默认的anchor.

 1 def ssd_anchor_one_layer(img_shape,
 2                          feat_shape,
 3                          sizes,
 4                          ratios,
 5                          step,
 6                          offset=0.5,
 7                          dtype=np.float32):
 8     ##函数作用：生成每一层feature map的不同方格的不同anchor的中心坐标和w,h并返回
 9     ##生成每层feature map中每个小方框的中心坐标位置 *step/img_shape结果为在原图中相对位置
10     y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
11     y = (y.astype(dtype) + offset) * step / img_shape[0]
12     x = (x.astype(dtype) + offset) * step / img_shape[1]
13 
14     # Expand dims to support easy broadcasting.
15     y = np.expand_dims(y, axis=-1)
16     x = np.expand_dims(x, axis=-1)
17 
18     # Compute relative height and width.
19     # Tries to follow the original implementation of SSD for the order.
20     ###每个feature map的每个小方格，有4-6个anchor,这4-6个anchor比例不同，分别为{1,2,3,1/2,1/3}。但是同一个feature map的不同小方格，对应的anchor
21     ####w,h是相通的
22     num_anchors = len(sizes) + len(ratios)  ###anchor个数
23     h = np.zeros((num_anchors, ), dtype=dtype)
24     w = np.zeros((num_anchors, ), dtype=dtype)
25     # Add first anchor boxes with ratio=1. 1:1的anchor的w,h
26     h[0] = sizes[0] / img_shape[0]
27     w[0] = sizes[0] / img_shape[1]
28     di = 1
29     if len(sizes) > 1: ###另外一个1:1的anchor的w,h
30         h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]
31         w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
32         di += 1
33     for i, r in enumerate(ratios): ####其他比例的anchor的w,h比如{2,3,1/2,1/3}计算方式已写
34         h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)
35         w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)
36     return y, x, h, w   
37 
38 
39 def ssd_anchors_all_layers(img_shape,
40                            layers_shape,
41                            anchor_sizes,
42                            anchor_ratios,
43                              anchor_steps,
44                            offset=0.5,
45                            dtype=np.float32):
46     """Compute anchor boxes for all feature layers.
47     生成不同层feature map的anchor并返回
48     """
49     layers_anchors = []
50     for i, s in enumerate(layers_shape):
51         anchor_bboxes = ssd_anchor_one_layer(img_shape, s,
52                                              anchor_sizes[i],
53                                              anchor_ratios[i],
54                                              anchor_steps[i],
55                                              offset=offset, dtype=dtype)
56         layers_anchors.append(anchor_bboxes)
57     return layers_anchors

上面通过网络生成了预测的anchor坐标接下来便是ground Truth的处理，用到的函数主要为tf_ssd_bboxes_encode_layer。此函数的作用是对每一层feature map的预测框进行处理，去除掉不满足要求的预测框(即设为0），同时对满足要求的预测框找出与真实框的对应关系。

  1 def tf_ssd_bboxes_encode_layer(labels,
  2                                bboxes,
  3                                anchors_layer,
  4                                num_classes,
  5                                no_annotation_label,
  6                                ignore_threshold=0.5,
  7                                prior_scaling=[0.1, 0.1, 0.2, 0.2],
  8                                dtype=tf.float32):
  9     """Encode  groundtruth labels and bounding boxes using SSD anchors from
 10     one layer.
 11 
 12     Arguments:
 13       labels: 1D Tensor(int64) containing groundtruth labels;
 14       bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
 15       anchors_layer: Numpy array with layer anchors;
 16       matching_threshold: Threshold for positive match with groundtruth bboxes;
 17       prior_scaling: Scaling of encoded coordinates.
 18 
 19     Return:
 20       (target_labels, target_localizations, target_scores): Target Tensors.
 21     """
 22     # Anchors coordinates and volume.
 23     yref, xref, href, wref = anchors_layer ###固定生成的anchor的中心坐标及w,h等
 24     ymin = yref - href / 2.
 25     xmin = xref - wref / 2.
 26     ymax = yref + href / 2.
 27     xmax = xref + wref / 2.
 28     vol_anchors = (xmax - xmin) * (ymax - ymin) ###预测框四个角的坐标及面积
 29 
 30     # Initialize tensors...
 31     shape = (yref.shape[0], yref.shape[1], href.size) ###S*S*(4-6）
 32     feat_labels = tf.zeros(shape, dtype=tf.int64) ##每个预测框的标签
 33     feat_scores = tf.zeros(shape, dtype=dtype)##每个预测框的得分
 34     ###每个预测框四个点的坐标
 35     feat_ymin = tf.zeros(shape, dtype=dtype)
 36     feat_xmin = tf.zeros(shape, dtype=dtype)
 37     feat_ymax = tf.ones(shape, dtype=dtype)
 38     feat_xmax = tf.ones(shape, dtype=dtype)
 39     ####计算预测框与真实框的IOU ，box为真实框的坐标
 40     def jaccard_with_anchors(bbox):
 41         """Compute jaccard score between a box and the anchors.
 42         """
 43         int_ymin = tf.maximum(ymin, bbox[0])
 44         int_xmin = tf.maximum(xmin, bbox[1])
 45         int_ymax = tf.minimum(ymax, bbox[2])
 46         int_xmax = tf.minimum(xmax, bbox[3])
 47         h = tf.maximum(int_ymax - int_ymin, 0.)
 48         w = tf.maximum(int_xmax - int_xmin, 0.)
 49         # Volumes.
 50         inter_vol = h * w
 51         union_vol = vol_anchors - inter_vol 
 52             + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
 53         jaccard = tf.div(inter_vol, union_vol)
 54         return jaccard
 55     ####score得分即为重叠部分/预测框面积
 56     def intersection_with_anchors(bbox):
 57         """Compute intersection between score a box and the anchors.
 58         """
 59         int_ymin = tf.maximum(ymin, bbox[0])
 60         int_xmin = tf.maximum(xmin, bbox[1])
 61         int_ymax = tf.minimum(ymax, bbox[2])
 62         int_xmax = tf.minimum(xmax, bbox[3])
 63         h = tf.maximum(int_ymax - int_ymin, 0.)
 64         w = tf.maximum(int_xmax - int_xmin, 0.)
 65         inter_vol = h * w
 66         scores = tf.div(inter_vol, vol_anchors)
 67         return scores
 68 
 69     def condition(i, feat_labels, feat_scores,
 70                   feat_ymin, feat_xmin, feat_ymax, feat_xmax):
 71         """Condition: check label index.
 72         """
 73         r = tf.less(i, tf.shape(labels))
 74         return r[0]
 75 
 76     def body(i, feat_labels, feat_scores,
 77              feat_ymin, feat_xmin, feat_ymax, feat_xmax):
 78         """Body: update feature labels, scores and bboxes.
 79         Follow the original SSD paper for that purpose:
 80           - assign values when jaccard > 0.5;
 81           - only update if beat the score of other bboxes.
 82         """
 83         # Jaccard score.
 84         label = labels[i]
 85         bbox = bboxes[i]
 86         jaccard = jaccard_with_anchors(bbox)
 87         # Mask: check threshold + scores + no annotations + num_classes.
 88         mask = tf.greater(jaccard, feat_scores)
 89         # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
 90         mask = tf.logical_and(mask, feat_scores > -0.5)
 91         mask = tf.logical_and(mask, label < num_classes) ####逻辑判断，那些项IOU大于阈值
 92         imask = tf.cast(mask, tf.int64)
 93         fmask = tf.cast(mask, dtype)
 94         # Update values using mask.更新那些满足要求的预测框，使他们类别，四个点的坐标位置和置信度分别为真实框的值，否则为0
 95         feat_labels = imask * label + (1 - imask) * feat_labels
 96         feat_scores = tf.where(mask, jaccard, feat_scores)
 97 
 98         feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
 99         feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
100         feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
101         feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax
102 
103         # Check no annotation label: ignore these anchors...
104         # interscts = intersection_with_anchors(bbox)
105         # mask = tf.logical_and(interscts > ignore_threshold,
106         #                       label == no_annotation_label)
107         # # Replace scores by -1.
108         # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)
109 
110         return [i+1, feat_labels, feat_scores,
111                 feat_ymin, feat_xmin, feat_ymax, feat_xmax]
112     # Main loop definition.
113     i = 0
114     [i, feat_labels, feat_scores,
115      feat_ymin, feat_xmin,
116      feat_ymax, feat_xmax] = tf.while_loop(condition, body,
117                                            [i, feat_labels, feat_scores,
118                                             feat_ymin, feat_xmin,
119                                             feat_ymax, feat_xmax])
120     # Transform to center / size.
121     feat_cy = (feat_ymax + feat_ymin) / 2.
122     feat_cx = (feat_xmax + feat_xmin) / 2.
123     feat_h = feat_ymax - feat_ymin
124     feat_w = feat_xmax - feat_xmin
125     # Encode features.
126     feat_cy = (feat_cy - yref) / href / prior_scaling[0]
127     feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
128     feat_h = tf.log(feat_h / href) / prior_scaling[2]
129     feat_w = tf.log(feat_w / wref) / prior_scaling[3]
130     # Use SSD ordering: x / y / w / h instead of ours.  此处返回的不是坐标值，而是偏差值。此处与SSD不同
131     feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
132     return feat_labels, feat_localizations, feat_scores

接下来便是最重要的部分，即损失函数源码阅读。损失函数在论文中定义如下

分为类别置信度偏差和坐标位移偏差。上式已经有进过网络的的提取的值及经过groundTruth处理后的值，现在把两者结合，进行loss计算。主要的函数为ssd_losses。

 1 def ssd_losses(logits, localisations,
 2                gclasses, glocalisations, gscores,
 3                match_threshold=0.5,
 4                negative_ratio=3.,
 5                alpha=1.,
 6                label_smoothing=0.,
 7                device='/cpu:0',
 8                scope=None):
 9     with tf.name_scope(scope, 'ssd_losses'):
10         lshape = tfe.get_shape(logits[0], 5)
11         num_classes = lshape[-1]
12         batch_size = lshape[0]
13 
14         # Flatten out all vectors! 对预测框与groundTruth分别进行reshape，然后组合
15         flogits = []
16         fgclasses = []
17         fgscores = []
18         flocalisations = []
19         fglocalisations = []
20         for i in range(len(logits)):
21             flogits.append(tf.reshape(logits[i], [-1, num_classes]))
22             fgclasses.append(tf.reshape(gclasses[i], [-1]))
23             fgscores.append(tf.reshape(gscores[i], [-1]))
24             flocalisations.append(tf.reshape(localisations[i], [-1, 4]))
25             fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4]))
26         # And concat the crap!
27         logits = tf.concat(flogits, axis=0)
28         gclasses = tf.concat(fgclasses, axis=0)
29         gscores = tf.concat(fgscores, axis=0)
30         localisations = tf.concat(flocalisations, axis=0)
31         glocalisations = tf.concat(fglocalisations, axis=0)
32         dtype = logits.dtype
33 
34         # Compute positive matching mask...
35         ###筛选IOU>0.5的预测框
36         pmask = gscores > match_threshold
37         fpmask = tf.cast(pmask, dtype)
38         n_positives = tf.reduce_sum(fpmask)
39 
40         # Hard negative mining...
41         ###对于IOU《0.5的归为负类，即背景，预测项为第0项
42         no_classes = tf.cast(pmask, tf.int32)
43         predictions = slim.softmax(logits)
44         nmask = tf.logical_and(tf.logical_not(pmask),
45                                gscores > -0.5)
46         fnmask = tf.cast(nmask, dtype)
47         nvalues = tf.where(nmask,
48                            predictions[:, 0],
49                            1. - fnmask)
50         nvalues_flat = tf.reshape(nvalues, [-1])
51         # Number of negative entries to select.
52         ###负类最大比例为正类的3倍
53         max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32)
54         n_neg = tf.cast(negative_ratio * n_positives, tf.int32) + batch_size
55         n_neg = tf.minimum(n_neg, max_neg_entries)
56 
57         val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
58         max_hard_pred = -val[-1]
59         # Final negative mask.
60         nmask = tf.logical_and(nmask, nvalues < max_hard_pred)
61         fnmask = tf.cast(nmask, dtype)
62 
63         # Add cross-entropy loss.正类和负类的类别损失函数计算方式不同，主要是因为标签不一样
64         with tf.name_scope('cross_entropy_pos'):
65             loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
66                                                                   labels=gclasses)
67             loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name='value')
68             tf.losses.add_loss(loss)
69 
70         with tf.name_scope('cross_entropy_neg'):
71             loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits,
72                                                                   labels=no_classes)
73             loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name='value')
74             tf.losses.add_loss(loss)
75 
76         # Add localization loss: smooth L1, L2, ...
77         with tf.name_scope('localization'): ###预测预测损失函数
78             # Weights Tensor: positive mask + random negative.
79             weights = tf.expand_dims(alpha * fpmask, axis=-1)
80             loss = custom_layers.abs_smooth(localisations - glocalisations)
81             loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name='value')
82             tf.losses.add_loss(loss)  ###最终的loss

最后一部分就是前面的图像处理及预测之后的图像处理函数了。ssd_vgg_preprocessing.py是对训练或者预测图像进行预处理。就是图像增强这类的工作。

ssd_common.py中tf_ssd_bboxes_decode_layer 函数是对预测后的坐标进行处理，在图像中标出预测框的位置。而np_methods.py中基本是对预测框进行筛选，nms等，找出最合适的预测框

 1 def tf_ssd_bboxes_decode_layer(feat_localizations,
 2                                anchors_layer,
 3                                prior_scaling=[0.1, 0.1, 0.2, 0.2]):
 4     """Compute the relative bounding boxes from the layer features and
 5     reference anchor bounding boxes.
 6 
 7     Arguments:
 8       feat_localizations: Tensor containing localization features.
 9       anchors: List of numpy array containing anchor boxes.
10 
11     Return:
12       Tensor Nx4: ymin, xmin, ymax, xmax
13     """
14     yref, xref, href, wref = anchors_layer
15 
16     # Compute center, height and width 基本就是前面处理坐标的逆向过程。anchores_layer为不同anchor的坐标，
17     # feat_locations为预测框的偏差，反过来可以倒推预测框的坐标
18     cx = feat_localizations[:, :, :, :, 0] * wref * prior_scaling[0] + xref
19     cy = feat_localizations[:, :, :, :, 1] * href * prior_scaling[1] + yref
20     w = wref * tf.exp(feat_localizations[:, :, :, :, 2] * prior_scaling[2])
21     h = href * tf.exp(feat_localizations[:, :, :, :, 3] * prior_scaling[3])
22     # Boxes coordinates.
23     ymin = cy - h / 2.
24     xmin = cx - w / 2.
25     ymax = cy + h / 2.
26     xmax = cx + w / 2.
27     bboxes = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
28     return bboxes