Dropout
完成前向传播
def dropout_forward(x, dropout_param):
"""
Performs the forward pass for (inverted) dropout.
Inputs:
- x: Input data, of any shape
- dropout_param: A dictionary with the following keys:
- p: Dropout parameter. We keep each neuron output with probability p.
- mode: 'test' or 'train'. If the mode is train, then perform dropout;
if the mode is test, then just return the input.
- seed: Seed for the random number generator. Passing seed makes this
function deterministic, which is needed for gradient checking but not
in real networks.
Outputs:
- out: Array of the same shape as x.
- cache: tuple (dropout_param, mask). In training mode, mask is the dropout
mask that was used to multiply the input; in test mode, mask is None.
NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
See http://cs231n.github.io/neural-networks-2/#reg for more details.
NOTE 2: Keep in mind that p is the probability of **keep** a neuron
output; this might be contrary to some sources, where it is referred to
as the probability of dropping a neuron output.
"""
p, mode = dropout_param['p'], dropout_param['mode']
if 'seed' in dropout_param:
np.random.seed(dropout_param['seed'])
mask = None
out = None
if mode == 'train':
#######################################################################
# TODO: Implement training phase forward pass for inverted dropout. #
# Store the dropout mask in the mask variable. #
#######################################################################
keep_prob = 1 - p
mask = (np.random.rand(*x.shape) < keep_prob) / keep_prob
#首先,代码 (np.random.rand(*x.shape),表示根据输入数据矩阵x,亦即经过”激活”后的得分,生成一个相同shape的随机矩阵,其为均匀分布的随机样本[0,1)。然后将其与可被保留神经元的概率 keep_prob 做比较,就可以得到一个随机真值表作为随机失活遮罩(mask)。原始的办法是:由于在训练模式时,我们丢掉了部分的激活值,数值调整 out = mask * x 后造成整体分布的期望值的下降,因此在预测时就需要乘上一个概率 1/keep_prob,才能保持分布的统一。不过,我们用一种叫做inverted dropout的技巧,就是如上面代码所示,直接在训练模式下多除以一个概率 keep_prob,那么在测试模式下就不用做任何操作了,直接让数据通过dropout层即可。
out = mask * x
#######################################################################
# END OF YOUR CODE #
#######################################################################
elif mode == 'test':
#######################################################################
# TODO: Implement the test phase forward pass for inverted dropout. #
#######################################################################
out = x
#######################################################################
# END OF YOUR CODE #
#######################################################################
cache = (dropout_param, mask)
out = out.astype(x.dtype, copy=False)
return out, cache
Running tests with p = 0.25
Mean of input: 10.000207878477502
Mean of train-time output: 9.998198947788465
Mean of test-time output: 10.000207878477502
Fraction of train-time output set to zero: 0.250168
Fraction of test-time output set to zero: 0.0
Running tests with p = 0.4
Mean of input: 10.000207878477502
Mean of train-time output: 9.976910758765856
Mean of test-time output: 10.000207878477502
Fraction of train-time output set to zero: 0.401368
Fraction of test-time output set to zero: 0.0
Running tests with p = 0.7
Mean of input: 10.000207878477502
Mean of train-time output: 9.98254739313744
Mean of test-time output: 10.000207878477502
Fraction of train-time output set to zero: 0.700496
Fraction of test-time output set to zero: 0.0
完成后向传播
def dropout_backward(dout, cache):
"""
Perform the backward pass for (inverted) dropout.
Inputs:
- dout: Upstream derivatives, of any shape
- cache: (dropout_param, mask) from dropout_forward.
"""
dropout_param, mask = cache
mode = dropout_param['mode']
dx = None
if mode == 'train':
#######################################################################
# TODO: Implement training phase backward pass for inverted dropout #
#######################################################################
dx = mask * dout
#梯度反向传播时使用同样的 mask将被遮罩的梯度置零。
#######################################################################
# END OF YOUR CODE #
#######################################################################
elif mode == 'test':
dx = dout
return dx
dx relative error: 5.445612718272284e-11
带有dropout的全连接网络:
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function. This will also implement
dropout and batch/layer normalization as options. For a network with L layers,
the architecture will be
{affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax
where batch/layer normalization and dropout are optional, and the {...} block is
repeated L - 1 times.
Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
"""
def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
dropout=1, normalization=None, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
"""
Initialize a new FullyConnectedNet.
Inputs:
- hidden_dims: A list of integers giving the size of each hidden layer.
- input_dim: An integer giving the size of the input.
- num_classes: An integer giving the number of classes to classify.
- dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
the network should not use dropout at all.
- normalization: What type of normalization the network should use. Valid values
are "batchnorm", "layernorm", or None for no normalization (the default).
- reg: Scalar giving L2 regularization strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- dtype: A numpy datatype object; all computations will be performed using
this datatype. float32 is faster but less accurate, so you should use
float64 for numeric gradient checking.
- seed: If not None, then pass this random seed to the dropout layers. This
will make the dropout layers deteriminstic so we can gradient check the
model. 默认无随机种子,若有会传递给dropout层。
"""
self.normalization = normalization
self.use_dropout = dropout != 1
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}
############################################################################
# TODO: Initialize the parameters of the network, storing all values in #
# the self.params dictionary. Store weights and biases for the first layer #
# in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
# initialized from a normal distribution centered at 0 with standard #
# deviation equal to weight_scale. Biases should be initialized to zero. #
# #
# When using batch normalization, store scale and shift parameters for the #
# first layer in gamma1 and beta1; for the second layer use gamma2 and #
# beta2, etc. Scale parameters should be initialized to ones and shift #
# parameters should be initialized to zeros. #
############################################################################
#初始化所有隐藏层的参数
in_dim = input_dim #D
for i,h_dim in enumerate(hidden_dims): #(0,H1)(1,H2)
self.params['W%d' %(i+1,)] = weight_scale * np.random.randn(in_dim,h_dim)
self.params['b%d' %(i+1,)] = np.zeros((h_dim,))
if self.normalization=='batchnorm':
self.params['gamma%d' %(i+1,)] = np.ones((h_dim,)) #初始化为1
self.params['beta%d' %(i+1,)] = np.zeros((h_dim,)) #初始化为0
in_dim = h_dim #将该层的列数传递给下一层的行数
#初始化所有输出层的参数
self.params['W%d' %(self.num_layers,)] = weight_scale * np.random.randn(in_dim,num_classes)
self.params['b%d' %(self.num_layers,)] = np.zeros((num_classes,))
############################################################################
# END OF YOUR CODE #
############################################################################
# 当开启 dropout 时,我们需要在每一个神经元层中传递一个相同的 dropout 参数字典 self.dropout_param ,以保证每一层的神经元们 都知晓失活概率p和当前神经网络的模式状态mode(训练/测试)。
self.dropout_param = {} #dropout的参数字典
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed
# 当开启批量归一化时,我们要定义一个BN算法的参数列表 self.bn_params , 以用来跟踪记录每一层的平均值和标准差。其中,第0个元素 self.bn_params[0] 表示前向传播第1个BN层的参数,第1个元素 self.bn_params[1] 表示前向传播 第2个BN层的参数,以此类推。
self.bn_params = [] #BN的参数字典
if self.normalization=='batchnorm':
self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
if self.normalization=='layernorm':
self.bn_params = [{} for i in range(self.num_layers - 1)]
# Cast all parameters to the correct datatype
for k, v in self.params.items():
self.params[k] = v.astype(dtype)
def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.
Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'
# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.use_dropout:
self.dropout_param['mode'] = mode
if self.normalization=='batchnorm':
for bn_param in self.bn_params:
bn_param['mode'] = mode
scores = None
############################################################################
# TODO: Implement the forward pass for the fully-connected net, computing #
# the class scores for X and storing them in the scores variable. #
# #
# When using dropout, you'll need to pass self.dropout_param to each #
# dropout forward pass. #
# #
# When using batch normalization, you'll need to pass self.bn_params[0] to #
# the forward pass for the first batch normalization layer, pass #
# self.bn_params[1] to the forward pass for the second batch normalization #
# layer, etc. #
############################################################################
fc_mix_cache = {} # # 初始化每层前向传播的缓冲字典
if self.use_dropout: # 如果开启了dropout,初始化其对应的缓冲字典
dp_cache = {}
# 从第一个隐藏层开始循环每一个隐藏层,传递数据out,保存每一层的缓冲cache
out = X
for i in range(self.num_layers - 1): # 在每个hidden层中循环
w,b = self.params['W%d' %(i+1,)],self.params['b%d' %(i+1,)]
if self.normalization == 'batchnorm':
gamma = self.params['gamma%d' %(i+1,)]
beta = self.params['beta%d' %(i+1,)]
out,fc_mix_cache[i] = affine_bn_relu_forward(out,w,b,gamma,beta,self.bn_params[i])
else:
out,fc_mix_cache[i] = affine_relu_forward(out,w,b)
if self.use_dropout:
out,dp_cache[i] = dropout_forward(out,self.dropout_param)
#最后的输出层
w = self.params['W%d' %(self.num_layers,)]
b = self.params['b%d' %(self.num_layers,)]
out,out_cache = affine_forward(out,w,b)
scores = out
############################################################################
# END OF YOUR CODE #
############################################################################
# If test mode return early
if mode == 'test':
return scores
loss, grads = 0.0, {}
############################################################################
# TODO: Implement the backward pass for the fully-connected net. Store the #
# loss in the loss variable and gradients in the grads dictionary. Compute #
# data loss using softmax, and make sure that grads[k] holds the gradients #
# for self.params[k]. Don't forget to add L2 regularization! #
# #
# When using batch/layer normalization, you don't need to regularize the scale #
# and shift parameters. #
# #
# NOTE: To ensure that your implementation matches ours and you pass the #
# automated tests, make sure that your L2 regularization includes a factor #
# of 0.5 to simplify the expression for the gradient. #
############################################################################
loss,dout = softmax_loss(scores,y)
loss += 0.5 * self.reg * np.sum(self.params['W%d' %(self.num_layers,)] ** 2)
# 在输出层处梯度的反向传播,顺便把梯度保存在梯度字典 grad 中:
dout,dw,db = affine_backward(dout,out_cache)
grads['W%d' %(self.num_layers,)] = dw + self.reg * self.params['W%d' %(self.num_layers,)]
grads['b%d' %(self.num_layers,)] = db
# 在每一个隐藏层处梯度的反向传播,不仅顺便更新了梯度字典 grad,还迭代算出了损失值loss
for i in range(self.num_layers - 1):
ri = self.num_layers - 2 - i #倒数第ri+1隐藏层
loss += 0.5 * self.reg * np.sum(self.params['W%d' %(ri+1,)] ** 2) #迭代地补上每层的正则项给loss
if self.use_dropout:
dout = dropout_backward(dout,dp_cache[ri])
if self.normalization == 'batchnorm':
dout,dw,db,dgamma,dbeta = affine_bn_relu_backward(dout,fc_mix_cache[ri])
grads['gamma%d' %(ri+1,)] = dgamma
grads['beta%d' %(ri+1,)] = dbeta
else:
dout,dw,db = affine_relu_backward(dout,fc_mix_cache[ri])
grads['W%d' %(ri+1,)] = dw + self.reg * self.params['W%d' %(ri+1,)]
grads['b%d' %(ri+1,)] = db
############################################################################
# END OF YOUR CODE #
############################################################################
return loss, grads
Running check with dropout = 1
Initial loss: 2.3004790897684924
W1 relative error: 1.48e-07
W2 relative error: 2.21e-05
W3 relative error: 3.53e-07
b1 relative error: 5.38e-09
b2 relative error: 2.09e-09
b3 relative error: 5.80e-11
Running check with dropout = 0.75
Initial loss: 2.2924325088330475
W1 relative error: 2.74e-08
W2 relative error: 2.98e-09
W3 relative error: 4.29e-09
b1 relative error: 7.78e-10
b2 relative error: 3.36e-10
b3 relative error: 1.65e-10
Running check with dropout = 0.5
Initial loss: 2.3042759220785896
W1 relative error: 3.11e-07
W2 relative error: 1.84e-08
W3 relative error: 5.35e-08
b1 relative error: 5.37e-09
b2 relative error: 2.99e-09
b3 relative error: 1.13e-10
dropout可以视为一种正则化手段
1
(Iteration 1 / 125) loss: 7.856643
(Epoch 0 / 25) train acc: 0.260000; val_acc: 0.184000
(Epoch 1 / 25) train acc: 0.404000; val_acc: 0.259000
(Epoch 2 / 25) train acc: 0.468000; val_acc: 0.248000
(Epoch 3 / 25) train acc: 0.526000; val_acc: 0.247000
(Epoch 4 / 25) train acc: 0.646000; val_acc: 0.273000
(Epoch 5 / 25) train acc: 0.686000; val_acc: 0.257000
(Epoch 6 / 25) train acc: 0.690000; val_acc: 0.260000
(Epoch 7 / 25) train acc: 0.758000; val_acc: 0.255000
(Epoch 8 / 25) train acc: 0.832000; val_acc: 0.264000
(Epoch 9 / 25) train acc: 0.856000; val_acc: 0.268000
(Epoch 10 / 25) train acc: 0.914000; val_acc: 0.289000
(Epoch 11 / 25) train acc: 0.922000; val_acc: 0.293000
(Epoch 12 / 25) train acc: 0.948000; val_acc: 0.307000
(Epoch 13 / 25) train acc: 0.960000; val_acc: 0.313000
(Epoch 14 / 25) train acc: 0.972000; val_acc: 0.311000
(Epoch 15 / 25) train acc: 0.964000; val_acc: 0.309000
(Epoch 16 / 25) train acc: 0.966000; val_acc: 0.295000
(Epoch 17 / 25) train acc: 0.984000; val_acc: 0.306000
(Epoch 18 / 25) train acc: 0.988000; val_acc: 0.332000
(Epoch 19 / 25) train acc: 0.996000; val_acc: 0.318000
(Epoch 20 / 25) train acc: 0.992000; val_acc: 0.313000
(Iteration 101 / 125) loss: 0.000961
(Epoch 21 / 25) train acc: 0.996000; val_acc: 0.311000
(Epoch 22 / 25) train acc: 0.994000; val_acc: 0.304000
(Epoch 23 / 25) train acc: 0.998000; val_acc: 0.308000
(Epoch 24 / 25) train acc: 1.000000; val_acc: 0.316000
(Epoch 25 / 25) train acc: 0.998000; val_acc: 0.320000
0.25
(Iteration 1 / 125) loss: 11.299055
(Epoch 0 / 25) train acc: 0.234000; val_acc: 0.187000
(Epoch 1 / 25) train acc: 0.382000; val_acc: 0.228000
(Epoch 2 / 25) train acc: 0.490000; val_acc: 0.247000
(Epoch 3 / 25) train acc: 0.534000; val_acc: 0.228000
(Epoch 4 / 25) train acc: 0.648000; val_acc: 0.298000
(Epoch 5 / 25) train acc: 0.676000; val_acc: 0.316000
(Epoch 6 / 25) train acc: 0.752000; val_acc: 0.285000
(Epoch 7 / 25) train acc: 0.774000; val_acc: 0.252000
(Epoch 8 / 25) train acc: 0.818000; val_acc: 0.288000
(Epoch 9 / 25) train acc: 0.844000; val_acc: 0.326000
(Epoch 10 / 25) train acc: 0.864000; val_acc: 0.311000
(Epoch 11 / 25) train acc: 0.920000; val_acc: 0.293000
(Epoch 12 / 25) train acc: 0.922000; val_acc: 0.282000
(Epoch 13 / 25) train acc: 0.960000; val_acc: 0.303000
(Epoch 14 / 25) train acc: 0.966000; val_acc: 0.290000
(Epoch 15 / 25) train acc: 0.948000; val_acc: 0.277000
(Epoch 16 / 25) train acc: 0.970000; val_acc: 0.324000
(Epoch 17 / 25) train acc: 0.950000; val_acc: 0.295000
(Epoch 18 / 25) train acc: 0.970000; val_acc: 0.316000
(Epoch 19 / 25) train acc: 0.972000; val_acc: 0.296000
(Epoch 20 / 25) train acc: 0.990000; val_acc: 0.293000
(Iteration 101 / 125) loss: 0.556808
(Epoch 21 / 25) train acc: 0.990000; val_acc: 0.303000
(Epoch 22 / 25) train acc: 0.990000; val_acc: 0.306000
(Epoch 23 / 25) train acc: 0.992000; val_acc: 0.301000
(Epoch 24 / 25) train acc: 0.994000; val_acc: 0.303000
(Epoch 25 / 25) train acc: 0.998000; val_acc: 0.289000
这张图真的能看出来什么吗。。train上的准确率几乎相同,validation上的准确率也差不多。。