YOLOV3算法详解

YOLOV3

YOLO3主要的改进有：调整了网络结构；利用多尺度特征进行对象检测；对象分类用Logistic取代了softmax。

新的网络结构Darknet -53

darknet-53借用了resnet的思想，在网络中加入了残差模块，这样有利于解决深层次网络的梯度问题，每个残差模块由两个卷积层和一个shortcut connections,

1,2,8,8,4代表有几个重复的残差模块，整个v3结构里面，没有池化层和全连接层，网络的下采样是通过设置卷积的stride为2来达到的，每当通过这个卷积层之后

图像的尺寸就会减小到一半。而每个卷积层的实现又是包含卷积+BN+Leaky relu ,每个残差模块之后又要加上一个zero padding,具体实现可以参考下面的一张图。

论文中所给的网络结构如下，由卷积模块和残差模块组成；

模型可视化

具体的全部模型结构可以从这个网站的工具进行可视化分析：

https://lutzroeder.github.io/netron/

从Yolo的官网上下载yolov3的权重文件，然后通过官网上的指导转化为H5文件，然后可以再这个浏览器工具里直接看yolov3的每一层是如何分布的；类似下边截图是一部分网络（最后的拼接部分）；

整体来看Yolov3的输入与输出形式如下：

输入416*416*3的图像，通过darknet网络得到三种不同尺度的预测结果，每个尺度都对应N个通道，包含着预测的信息；

每个网格每个尺寸的anchors的预测结果。

对比下yolov1,有7*7*2个预测；

对比下yolov2，有13*13*5个预测；

YOLOv3共有13*13*3 + 26*26*3 + 52*52*3个预测。

每个预测对应85维，分别是4（坐标值）、1（置信度分数）、80（coco类别数）

多尺度检测：

对于多尺度检测来说，采用多个尺度进行预测，具体形式是在网络预测的最后某些层进行上采样拼接的操作来达到；对于分辨率对预测的影响如下解释：

分辨率信息直接反映的就是构成object的像素的数量。一个object，像素数量越多，它对object的细节表现就越丰富越具体，也就是说分辨率信息越丰富。这也就是为什么大尺度feature map提供的是分辨率信息了。语义信息在目标检测中指的是让object区分于背景的信息，即语义信息是让你知道这个是object，其余是背景。在不同类别中语义信息并不需要很多细节信息，分辨率信息大，反而会降低语义信息，因此小尺度feature map在提供必要的分辨率信息下语义信息会提供的更好。(而对于小目标，小尺度feature map无法提供必要的分辨率信息，所以还需结合大尺度的feature map)

YOLO3更进一步采用了3个不同尺度的特征图来进行对象检测。能够检测的到更加细粒度的特征。

对于这三种检测的结果并不是同样的东西，这里的粗略理解是不同给的尺度检测不同大小的物体。

网络的最终输出有3个尺度分别为1/32，1/16，1/8；

在第79层之后经过几个卷积操作得到的是1/32 （13*13）的预测结果，下采样倍数高，这里特征图的感受野比较大，因此适合检测图像中尺寸比较大的对象。

然后这个结果通过上采样与第61层的结果进行concat,再经过几个卷积操作得到1/16的预测结果；它具有中等尺度的感受野，适合检测中等尺度的对象。

91层的结果经过上采样之后在于第36层的结果进行concat，经过几个卷积操作之后得到的是1/8的结果，它的感受野最小，适合检测小尺寸的对象。

concat：张量拼接。将darknet中间层和后面的某一层的上采样进行拼接。拼接的操作和残差层add的操作是不一样的，拼接会扩充张量的维度，而add只是直接相加不会导致张量维度的改变。

使用Kmeans聚类的方法来决定anchors的尺寸大小：

YOLO2已经开始采用K-means聚类得到先验框的尺寸，YOLO3延续了这种方法，为每种下采样尺度设定3种先验框，总共聚类出9种尺寸的先验框。

在COCO数据集这9个先验框是：(10x13)，(16x30)，(33x23)，(30x61)，(62x45)，(59x119)，(116x90)，(156x198)，(373x326)。

分配上，在最小的13*13特征图上（有最大的感受野）应用较大的先验框(116x90)，(156x198)，(373x326)，适合检测较大的对象。中等的26*26

特征图上（中等感受野）应用中等的先验框(30x61)，(62x45)，(59x119)，适合检测中等大小的对象。较大的52*52特征图上（较小的感受野）应用

较小的先验框(10x13)，(16x30)，(33x23)，适合检测较小的对象。

yolo v3对bbox进行预测的时候，采用了logistic regression。yolo v3每次对b-box进行predict时，输出和v2一样都是(tx,ty,tw,th,to) ,然后通过公式1计算出绝对的(x, y, w, h, c)。

logistic回归用于对anchor包围的部分进行一个目标性评分(objectness score)，（用于NMS），即这块位置是目标的可能性有多大。

公式一：

yolo_v3只会对1个prior进行操作，也就是那个最佳prior。而logistic回归就是用来从9个anchor priors中找到objectness score(目标存在可能性得分)最高的那一个。

对象分类softmax改成logistic

预测对象类别时不使用softmax，改成使用logistic的输出进行预测。这样能够支持多标签对象（比如一个人有Woman 和 Person两个标签）。

YOLO的LOSS函数：

 1         # K.binary_crossentropy is helpful to avoid exp overflow.
 2         xy_loss = object_mask * box_loss_scale * K.binary_crossentropy(raw_true_xy, raw_pred[...,0:2], from_logits=True)
 3         wh_loss = object_mask * box_loss_scale * 0.5 * K.square(raw_true_wh-raw_pred[...,2:4])
 4         confidence_loss = object_mask * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True)+ 
 5             (1-object_mask) * K.binary_crossentropy(object_mask, raw_pred[...,4:5], from_logits=True) * ignore_mask
 6         class_loss = object_mask * K.binary_crossentropy(true_class_probs, raw_pred[...,5:], from_logits=True)
 7 
 8         xy_loss = K.sum(xy_loss) / mf
 9         wh_loss = K.sum(wh_loss) / mf
10         confidence_loss = K.sum(confidence_loss) / mf
11         class_loss = K.sum(class_loss) / mf
12         loss += xy_loss + wh_loss + confidence_loss + class_loss
13         if print_loss:
14             loss = tf.Print(loss, [loss, xy_loss, wh_loss, confidence_loss, class_loss, K.sum(ignore_mask)], message='loss: ')
15     return loss

yolov3再keras中定义的Loss函数如上，基本可以看出，对于回归预测的部分是采用多个mse均方差相加来进行的（类似于上边所提到的v1中的Loss函数），对于分类部分和置信度是采用K.binary_crossentropy来进行的，最后把两种Loss相加得出最终的loss

部分源码分析：

 1 def yolo_eval(yolo_outputs,
 2               anchors,
 3               num_classes,
 4               image_shape,
 5               max_boxes=20,
 6               score_threshold=.6,
 7               iou_threshold=.5):
 8     """Evaluate YOLO model on given input and return filtered boxes."""
'''此函数的作用是对于Yolo模型的输出进行筛选，通过yolo_boxes_and_scores来给yolo_outputs打分，
然后通过（box_scores >= score_threshold）来初步把分数较低的预测去掉，再通过NMS操作对每个对象的anchors进行筛选，最终为每个对象得出一个分数最大且不重复的预测结果。'''


 9     num_layers = len(yolo_outputs)
10     anchor_mask = [[6,7,8], [3,4,5], [0,1,2]] if num_layers==3 else [[3,4,5], [1,2,3]] # default setting
11     input_shape = K.shape(yolo_outputs[0])[1:3] * 32
12     boxes = []
13     box_scores = []
14     for l in range(num_layers):
15         _boxes, _box_scores = yolo_boxes_and_scores(yolo_outputs[l],
16             anchors[anchor_mask[l]], num_classes, input_shape, image_shape)
17         boxes.append(_boxes)
18         box_scores.append(_box_scores)
19     boxes = K.concatenate(boxes, axis=0)
20     box_scores = K.concatenate(box_scores, axis=0)
21 
22     mask = box_scores >= score_threshold
'''设定分数的阈值'''
23     max_boxes_tensor = K.constant(max_boxes, dtype='int32')
24     boxes_ = []
25     scores_ = []
26     classes_ = []
27     for c in range(num_classes):
28         # TODO: use keras backend instead of tf.
29         class_boxes = tf.boolean_mask(boxes, mask[:, c])
30         class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])
31         nms_index = tf.image.non_max_suppression(
32             class_boxes, class_box_scores, max_boxes_tensor, iou_threshold=iou_threshold)
'''进行NMS非极大值抑制操作，设置iou_threshold,筛选得出最佳结果；    这里使用的是tensorflow自带的nms方法，pytorch中一般都是自定义的nms'''
33         class_boxes = K.gather(class_boxes, nms_index)
34         class_box_scores = K.gather(class_box_scores, nms_index)
35         classes = K.ones_like(class_box_scores, 'int32') * c
36         boxes_.append(class_boxes)
37         scores_.append(class_box_scores)
38         classes_.append(classes)
39     boxes_ = K.concatenate(boxes_, axis=0)
40     scores_ = K.concatenate(scores_, axis=0)
41     classes_ = K.concatenate(classes_, axis=0)
42 
43     return boxes_, scores_, classes_

下面这个函数是加载模型调用上面的yolo_eval函数产生boxes,scores,classes：

 1     def generate(self):
 2         model_path = os.path.expanduser(self.model_path)
 3         assert model_path.endswith('.h5'), 'Keras model or weights must be a .h5 file.'
 4 
 5         # Load model, or construct model and load weights.
 6         num_anchors = len(self.anchors)
 7         num_classes = len(self.class_names)
 8         is_tiny_version = num_anchors==6 # default setting
 9         try:
10             self.yolo_model = load_model(model_path, compile=False)
11         except:
12             self.yolo_model = tiny_yolo_body(Input(shape=(None,None,3)), num_anchors//2, num_classes) 
13                 if is_tiny_version else yolo_body(Input(shape=(None,None,3)), num_anchors//3, num_classes)
14             self.yolo_model.load_weights(self.model_path) # make sure model, anchors and classes match
15         else:
16             assert self.yolo_model.layers[-1].output_shape[-1] == 
17                 num_anchors/len(self.yolo_model.output) * (num_classes + 5), 
18                 'Mismatch between model and given anchor and class sizes'
19 
20         print('{} model, anchors, and classes loaded.'.format(model_path))
21 
22         # Generate colors for drawing bounding boxes.
23         hsv_tuples = [(x / len(self.class_names), 1., 1.)
24                       for x in range(len(self.class_names))]
25         self.colors = list(map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples))
26         self.colors = list(
27             map(lambda x: (int(x[0] * 255), int(x[1] * 255), int(x[2] * 255)),
28                 self.colors))
29         np.random.seed(10101)  # Fixed seed for consistent colors across runs.
30         np.random.shuffle(self.colors)  # Shuffle colors to decorrelate adjacent classes.
31         np.random.seed(None)  # Reset seed to default.
32 
33         # Generate output tensor targets for filtered bounding boxes.
34         self.input_image_shape = K.placeholder(shape=(2, ))
35         if self.gpu_num>=2:
36             self.yolo_model = multi_gpu_model(self.yolo_model, gpus=self.gpu_num)
37         boxes, scores, classes = yolo_eval(self.yolo_model.output, self.anchors,
38                 len(self.class_names), self.input_image_shape,
39                 score_threshold=self.score, iou_threshold=self.iou)
40         return boxes, scores, classes

our system only assigns one bounding box prior for each ground truth object.论文原话，v3预测bbox只预测一个最准确的；在训练过程中yolov3是预测9个anchors的，对于loss的计算是找到iou最大的哪个，但是预测的时候只选一个最精确的。

 1 def preprocess_true_boxes(true_boxes, input_shape, anchors, num_classes):
 2     '''Preprocess true boxes to training input format
 3 
 4     Parameters
 5     ----------
 6     true_boxes: array, shape=(m, T, 5)
 7         Absolute x_min, y_min, x_max, y_max, class_id relative to input_shape.
 8     input_shape: array-like, hw, multiples of 32
 9     anchors: array, shape=(N, 2), wh
10     num_classes: integer
11 
12     Returns
13     -------
14     y_true: list of array, shape like yolo_outputs, xywh are reletive value
15 
16     '''
17     assert (true_boxes[..., 4]<num_classes).all(), 'class id must be less than num_classes'
18     num_layers = len(anchors)//3 # default setting
19     anchor_mask = [[6,7,8], [3,4,5], [0,1,2]] if num_layers==3 else [[3,4,5], [1,2,3]]
20 
21     true_boxes = np.array(true_boxes, dtype='float32')
22     input_shape = np.array(input_shape, dtype='int32')
23     boxes_xy = (true_boxes[..., 0:2] + true_boxes[..., 2:4]) // 2
24     boxes_wh = true_boxes[..., 2:4] - true_boxes[..., 0:2]
25     true_boxes[..., 0:2] = boxes_xy/input_shape[::-1]
26     true_boxes[..., 2:4] = boxes_wh/input_shape[::-1]
27 
28     m = true_boxes.shape[0]
29     grid_shapes = [input_shape//{0:32, 1:16, 2:8}[l] for l in range(num_layers)]
30     y_true = [np.zeros((m,grid_shapes[l][0],grid_shapes[l][1],len(anchor_mask[l]),5+num_classes),
31         dtype='float32') for l in range(num_layers)]
32 
33     # Expand dim to apply broadcasting.
34     anchors = np.expand_dims(anchors, 0)
35     anchor_maxes = anchors / 2.
36     anchor_mins = -anchor_maxes
37     valid_mask = boxes_wh[..., 0]>0
38 
39     for b in range(m):
40         # Discard zero rows.
41         wh = boxes_wh[b, valid_mask[b]]
42         if len(wh)==0: continue
43         # Expand dim to apply broadcasting.
44         wh = np.expand_dims(wh, -2)
45         box_maxes = wh / 2.
46         box_mins = -box_maxes
47 
48         intersect_mins = np.maximum(box_mins, anchor_mins)
49         intersect_maxes = np.minimum(box_maxes, anchor_maxes)
50         intersect_wh = np.maximum(intersect_maxes - intersect_mins, 0.)
51         intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
52         box_area = wh[..., 0] * wh[..., 1]
53         anchor_area = anchors[..., 0] * anchors[..., 1]
54         iou = intersect_area / (box_area + anchor_area - intersect_area)
55 
56         # Find best anchor for each true box
57         best_anchor = np.argmax(iou, axis=-1)
58         '''获取最大Iou的那个anchor'''
59         for t, n in enumerate(best_anchor):
60             for l in range(num_layers):
61                 if n in anchor_mask[l]:
62                     i = np.floor(true_boxes[b,t,0]*grid_shapes[l][1]).astype('int32')
63                     j = np.floor(true_boxes[b,t,1]*grid_shapes[l][0]).astype('int32')
64                     k = anchor_mask[l].index(n)
65                     c = true_boxes[b,t, 4].astype('int32')
66                     y_true[l][b, j, i, k, 0:4] = true_boxes[b,t, 0:4]
67                     y_true[l][b, j, i, k, 4] = 1
68                     y_true[l][b, j, i, k, 5+c] = 1
69 
70     return y_true