CS231n assignment2 Q3 Dropout

Dropout

see Geoffrey E. Hinton et al, "Improving neural networks by preventing co-adaptation of feature detectors", arXiv 2012

完成前向传播

def dropout_forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.

    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We keep each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.

    NOTE: Please implement **inverted** dropout, not the vanilla version of dropout.
    See http://cs231n.github.io/neural-networks-2/#reg for more details.

    NOTE 2: Keep in mind that p is the probability of **keep** a neuron
    output; this might be contrary to some sources, where it is referred to
    as the probability of dropping a neuron output.
    """
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None

    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase forward pass for inverted dropout.   #
        # Store the dropout mask in the mask variable.                        #
        #######################################################################
        keep_prob = 1 - p
        mask = (np.random.rand(*x.shape) < keep_prob) / keep_prob
        #首先，代码 (np.random.rand(*x.shape)，表示根据输入数据矩阵x，亦即经过”激活”后的得分，生成一个相同shape的随机矩阵，其为均匀分布的随机样本[0,1)。然后将其与可被保留神经元的概率 keep_prob 做比较，就可以得到一个随机真值表作为随机失活遮罩(mask)。原始的办法是：由于在训练模式时，我们丢掉了部分的激活值，数值调整 out = mask * x 后造成整体分布的期望值的下降，因此在预测时就需要乘上一个概率 1/keep_prob，才能保持分布的统一。不过，我们用一种叫做inverted dropout的技巧，就是如上面代码所示，直接在训练模式下多除以一个概率 keep_prob，那么在测试模式下就不用做任何操作了，直接让数据通过dropout层即可。
        out = mask * x
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test phase forward pass for inverted dropout.   #
        #######################################################################
        out = x
        #######################################################################
        #                            END OF YOUR CODE                         #
        #######################################################################

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache

Running tests with p = 0.25
Mean of input: 10.000207878477502
Mean of train-time output: 9.998198947788465
Mean of test-time output: 10.000207878477502
Fraction of train-time output set to zero: 0.250168
Fraction of test-time output set to zero: 0.0

Running tests with p = 0.4
Mean of input: 10.000207878477502
Mean of train-time output: 9.976910758765856
Mean of test-time output: 10.000207878477502
Fraction of train-time output set to zero: 0.401368
Fraction of test-time output set to zero: 0.0

Running tests with p = 0.7
Mean of input: 10.000207878477502
Mean of train-time output: 9.98254739313744
Mean of test-time output: 10.000207878477502
Fraction of train-time output set to zero: 0.700496
Fraction of test-time output set to zero: 0.0

完成后向传播

def dropout_backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.

    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase backward pass for inverted dropout   #
        #######################################################################
        dx = mask * dout
        #梯度反向传播时使用同样的 mask将被遮罩的梯度置零。
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    elif mode == 'test':
        dx = dout
    return dx

dx relative error: 5.445612718272284e-11

带有dropout的全连接网络：

class FullyConnectedNet(object):
    """
    A fully-connected neural network with an arbitrary number of hidden layers,
    ReLU nonlinearities, and a softmax loss function. This will also implement
    dropout and batch/layer normalization as options. For a network with L layers,
    the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional, and the {...} block is
    repeated L - 1 times.

    Similar to the TwoLayerNet above, learnable parameters are stored in the
    self.params dictionary and will be learned using the Solver class.
    """

    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
                 dropout=1, normalization=None, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):
        """
        Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
          the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
          are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model. 默认无随机种子，若有会传递给dropout层。
        """
        self.normalization = normalization
        self.use_dropout = dropout != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        #初始化所有隐藏层的参数
        in_dim = input_dim #D
        for i,h_dim in enumerate(hidden_dims): #(0,H1)(1,H2)
            self.params['W%d' %(i+1,)] = weight_scale * np.random.randn(in_dim,h_dim)
            self.params['b%d' %(i+1,)] = np.zeros((h_dim,))
            if self.normalization=='batchnorm':
                self.params['gamma%d' %(i+1,)] = np.ones((h_dim,)) #初始化为1
                self.params['beta%d' %(i+1,)] = np.zeros((h_dim,)) #初始化为0
            in_dim = h_dim #将该层的列数传递给下一层的行数
            
        #初始化所有输出层的参数
        self.params['W%d' %(self.num_layers,)] = weight_scale * np.random.randn(in_dim,num_classes)
        self.params['b%d' %(self.num_layers,)] = np.zeros((num_classes,))
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        #  当开启 dropout 时，我们需要在每一个神经元层中传递一个相同的 dropout 参数字典 self.dropout_param ，以保证每一层的神经元们 都知晓失活概率p和当前神经网络的模式状态mode(训练／测试)。 
        self.dropout_param = {} #dropout的参数字典
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed

        #  当开启批量归一化时，我们要定义一个BN算法的参数列表 self.bn_params ， 以用来跟踪记录每一层的平均值和标准差。其中，第0个元素 self.bn_params[0] 表示前向传播第1个BN层的参数，第1个元素 self.bn_params[1] 表示前向传播 第2个BN层的参数，以此类推。
        self.bn_params = [] #BN的参数字典
        if self.normalization=='batchnorm':
            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
        if self.normalization=='layernorm':
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.

        Input / output: Same as TwoLayerNet above.
        """
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.normalization=='batchnorm':
            for bn_param in self.bn_params:
                bn_param['mode'] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        fc_mix_cache = {} # # 初始化每层前向传播的缓冲字典
        if self.use_dropout: # 如果开启了dropout，初始化其对应的缓冲字典
            dp_cache = {}
        # 从第一个隐藏层开始循环每一个隐藏层，传递数据out，保存每一层的缓冲cache
        out = X
        for i in range(self.num_layers - 1): # 在每个hidden层中循环
            w,b = self.params['W%d' %(i+1,)],self.params['b%d' %(i+1,)]
            if self.normalization == 'batchnorm':
                gamma = self.params['gamma%d' %(i+1,)]
                beta = self.params['beta%d' %(i+1,)]
                out,fc_mix_cache[i] = affine_bn_relu_forward(out,w,b,gamma,beta,self.bn_params[i])
            else:
                out,fc_mix_cache[i] = affine_relu_forward(out,w,b)
            if self.use_dropout:
                out,dp_cache[i] = dropout_forward(out,self.dropout_param)
        #最后的输出层
        w = self.params['W%d' %(self.num_layers,)]
        b = self.params['b%d' %(self.num_layers,)]
        out,out_cache = affine_forward(out,w,b)
        scores = out
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early
        if mode == 'test':
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the scale   #
        # and shift parameters.                                                    #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        loss,dout = softmax_loss(scores,y)
        loss += 0.5 * self.reg * np.sum(self.params['W%d' %(self.num_layers,)] ** 2)
        # 在输出层处梯度的反向传播，顺便把梯度保存在梯度字典 grad 中：
        dout,dw,db = affine_backward(dout,out_cache)
        grads['W%d' %(self.num_layers,)] = dw + self.reg * self.params['W%d' %(self.num_layers,)]
        grads['b%d' %(self.num_layers,)] = db
        # 在每一个隐藏层处梯度的反向传播，不仅顺便更新了梯度字典 grad，还迭代算出了损失值loss
        for i in range(self.num_layers - 1):
            ri = self.num_layers - 2 - i #倒数第ri+1隐藏层
            loss += 0.5 * self.reg * np.sum(self.params['W%d' %(ri+1,)] ** 2) #迭代地补上每层的正则项给loss
            if self.use_dropout:
                dout = dropout_backward(dout,dp_cache[ri])
            if self.normalization == 'batchnorm':
                dout,dw,db,dgamma,dbeta = affine_bn_relu_backward(dout,fc_mix_cache[ri])
                grads['gamma%d' %(ri+1,)] = dgamma
                grads['beta%d' %(ri+1,)] = dbeta
            else:
                dout,dw,db = affine_relu_backward(dout,fc_mix_cache[ri])
            grads['W%d' %(ri+1,)] = dw + self.reg * self.params['W%d' %(ri+1,)]
            grads['b%d' %(ri+1,)] = db
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

Running check with dropout = 1
Initial loss: 2.3004790897684924
W1 relative error: 1.48e-07
W2 relative error: 2.21e-05
W3 relative error: 3.53e-07
b1 relative error: 5.38e-09
b2 relative error: 2.09e-09
b3 relative error: 5.80e-11

Running check with dropout = 0.75
Initial loss: 2.2924325088330475
W1 relative error: 2.74e-08
W2 relative error: 2.98e-09
W3 relative error: 4.29e-09
b1 relative error: 7.78e-10
b2 relative error: 3.36e-10
b3 relative error: 1.65e-10

Running check with dropout = 0.5
Initial loss: 2.3042759220785896
W1 relative error: 3.11e-07
W2 relative error: 1.84e-08
W3 relative error: 5.35e-08
b1 relative error: 5.37e-09
b2 relative error: 2.99e-09
b3 relative error: 1.13e-10

dropout可以视为一种正则化手段

1
(Iteration 1 / 125) loss: 7.856643
(Epoch 0 / 25) train acc: 0.260000; val_acc: 0.184000
(Epoch 1 / 25) train acc: 0.404000; val_acc: 0.259000
(Epoch 2 / 25) train acc: 0.468000; val_acc: 0.248000
(Epoch 3 / 25) train acc: 0.526000; val_acc: 0.247000
(Epoch 4 / 25) train acc: 0.646000; val_acc: 0.273000
(Epoch 5 / 25) train acc: 0.686000; val_acc: 0.257000
(Epoch 6 / 25) train acc: 0.690000; val_acc: 0.260000
(Epoch 7 / 25) train acc: 0.758000; val_acc: 0.255000
(Epoch 8 / 25) train acc: 0.832000; val_acc: 0.264000
(Epoch 9 / 25) train acc: 0.856000; val_acc: 0.268000
(Epoch 10 / 25) train acc: 0.914000; val_acc: 0.289000
(Epoch 11 / 25) train acc: 0.922000; val_acc: 0.293000
(Epoch 12 / 25) train acc: 0.948000; val_acc: 0.307000
(Epoch 13 / 25) train acc: 0.960000; val_acc: 0.313000
(Epoch 14 / 25) train acc: 0.972000; val_acc: 0.311000
(Epoch 15 / 25) train acc: 0.964000; val_acc: 0.309000
(Epoch 16 / 25) train acc: 0.966000; val_acc: 0.295000
(Epoch 17 / 25) train acc: 0.984000; val_acc: 0.306000
(Epoch 18 / 25) train acc: 0.988000; val_acc: 0.332000
(Epoch 19 / 25) train acc: 0.996000; val_acc: 0.318000
(Epoch 20 / 25) train acc: 0.992000; val_acc: 0.313000
(Iteration 101 / 125) loss: 0.000961
(Epoch 21 / 25) train acc: 0.996000; val_acc: 0.311000
(Epoch 22 / 25) train acc: 0.994000; val_acc: 0.304000
(Epoch 23 / 25) train acc: 0.998000; val_acc: 0.308000
(Epoch 24 / 25) train acc: 1.000000; val_acc: 0.316000
(Epoch 25 / 25) train acc: 0.998000; val_acc: 0.320000
0.25
(Iteration 1 / 125) loss: 11.299055
(Epoch 0 / 25) train acc: 0.234000; val_acc: 0.187000
(Epoch 1 / 25) train acc: 0.382000; val_acc: 0.228000
(Epoch 2 / 25) train acc: 0.490000; val_acc: 0.247000
(Epoch 3 / 25) train acc: 0.534000; val_acc: 0.228000
(Epoch 4 / 25) train acc: 0.648000; val_acc: 0.298000
(Epoch 5 / 25) train acc: 0.676000; val_acc: 0.316000
(Epoch 6 / 25) train acc: 0.752000; val_acc: 0.285000
(Epoch 7 / 25) train acc: 0.774000; val_acc: 0.252000
(Epoch 8 / 25) train acc: 0.818000; val_acc: 0.288000
(Epoch 9 / 25) train acc: 0.844000; val_acc: 0.326000
(Epoch 10 / 25) train acc: 0.864000; val_acc: 0.311000
(Epoch 11 / 25) train acc: 0.920000; val_acc: 0.293000
(Epoch 12 / 25) train acc: 0.922000; val_acc: 0.282000
(Epoch 13 / 25) train acc: 0.960000; val_acc: 0.303000
(Epoch 14 / 25) train acc: 0.966000; val_acc: 0.290000
(Epoch 15 / 25) train acc: 0.948000; val_acc: 0.277000
(Epoch 16 / 25) train acc: 0.970000; val_acc: 0.324000
(Epoch 17 / 25) train acc: 0.950000; val_acc: 0.295000
(Epoch 18 / 25) train acc: 0.970000; val_acc: 0.316000
(Epoch 19 / 25) train acc: 0.972000; val_acc: 0.296000
(Epoch 20 / 25) train acc: 0.990000; val_acc: 0.293000
(Iteration 101 / 125) loss: 0.556808
(Epoch 21 / 25) train acc: 0.990000; val_acc: 0.303000
(Epoch 22 / 25) train acc: 0.990000; val_acc: 0.306000
(Epoch 23 / 25) train acc: 0.992000; val_acc: 0.301000
(Epoch 24 / 25) train acc: 0.994000; val_acc: 0.303000
(Epoch 25 / 25) train acc: 0.998000; val_acc: 0.289000

这张图真的能看出来什么吗。。train上的准确率几乎相同，validation上的准确率也差不多。。