CS231n assignment2 Q1 Fully-connected Neural Network

有句话叫“懂得了很多道理，依然过不好这一生。”用在这道题里很合适“懂得了每个过程的原理，依然写不好这代码。”
但抄完之后还是颇有收获的。

1、完成放射变换前向传播，f = wx + b

def affine_forward(x, w, b):
    """
    Computes the forward pass for an affine (fully-connected) layer.

    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
    examples, where each example x[i] has shape (d_1, ..., d_k). We will
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and
    then transform it to an output vector of dimension M.

    Inputs:
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
    - w: A numpy array of weights, of shape (D, M)
    - b: A numpy array of biases, of shape (M,)

    Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)
    """
    out = None
    ###########################################################################
    # TODO: Implement the affine forward pass. Store the result in out. You   #
    # will need to reshape the input into rows.                               #
    ###########################################################################
    reshaped_x = np.reshape(x,(x.shape[0],-1))
    #-1 是指剩余可填充的维度，所以这段代码意思就是保证reshape后的矩阵行数是N，剩余的维度信息都规则的排场一行即可。
    out = reshaped_x.dot(w) + b
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = (x, w, b)
    return out, cache

Testing affine_forward function:
difference: 9.769849468192957e-10

完成放射变换后向传播

def affine_backward(dout, cache):
    """
    Computes the backward pass for an affine layer.

    Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)
      - b: Biases, of shape (M,)

    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)
    """
    x, w, b = cache
    dx, dw, db = None, None, None
    ###########################################################################
    # TODO: Implement the affine backward pass.                               #
    ###########################################################################
    reshaped_x = np.reshape(x,(x.shape[0],-1))
    dx = np.reshape(dout.dot(w.T),x.shape) #dout为上一层传播来的导数
    dw = (reshaped_x.T).dot(dout)
    db = np.sum(dout,axis = 0)
    #f = wx + b,则df/dx = w,df/fw = x,df/db = 1 再转为矩阵形式
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dw, db

Testing affine_backward function:
dx error: 5.399100368651805e-11
dw error: 9.904211865398145e-11
db error: 2.4122867568119087e-11

完成使用relu的前向传播

def relu_forward(x):
    """
    Computes the forward pass for a layer of rectified linear units (ReLUs).

    Input:
    - x: Inputs, of any shape

    Returns a tuple of:
    - out: Output, of the same shape as x
    - cache: x
    """
    out = None
    ###########################################################################
    # TODO: Implement the ReLU forward pass.                                  #
    ###########################################################################
    out = np.maximum(0,x)
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = x
    return out, cache

Testing relu_forward function:
difference: 4.999999798022158e-08

完成使用relu的后向传播

def relu_backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).

    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout

    Returns:
    - dx: Gradient with respect to x
    """
    dx, x = None, cache
    ###########################################################################
    # TODO: Implement the ReLU backward pass.                                 #
    ###########################################################################
    dx = (x>0) * dout
    #与所有x中元素为正的位置处，位置对应于dout矩阵的元素保留，其他都取0
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx

Testing relu_backward function:
dx error: 3.2756349136310288e-12

"三明治"模型：

def affine_relu_forward(x, w, b):
    """
    Convenience layer that perorms an affine transform followed by a ReLU

    Inputs:
    - x: Input to the affine layer
    - w, b: Weights for the affine layer

    Returns a tuple of:
    - out: Output from the ReLU
    - cache: Object to give to the backward pass
    """
    a, fc_cache = affine_forward(x, w, b) #线性模型
    out, relu_cache = relu_forward(a)  #激活函数
    cache = (fc_cache, relu_cache)   #(x,w,b,(a))
    return out, cache


def affine_relu_backward(dout, cache):
    """
    Backward pass for the affine-relu convenience layer
    """
    fc_cache, relu_cache = cache # fc_cache = (x,w,b) relu_cache = a
    da = relu_backward(dout, relu_cache)  #da = (x>0) * relu_cache
    dx, dw, db = affine_backward(da, fc_cache)
    return dx, dw, db

Testing affine_relu_forward and affine_relu_backward:
dx error: 2.299579177309368e-11
dw error: 8.162011105764925e-11
db error: 7.826724021458994e-12

loss层：

def svm_loss(x, y):
    """
    Computes the loss and gradient using for multiclass SVM classification.

    Inputs:
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
      class for the ith input.
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
      0 <= y[i] < C

    Returns a tuple of:
    - loss: Scalar giving the loss
    - dx: Gradient of the loss with respect to x
    """
    N = x.shape[0]
    correct_class_scores = x[np.arange(N), y] #得到正确的标签
    margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0) #delta = 1
    margins[np.arange(N), y] = 0 #跳过同类的那个
    loss = np.sum(margins) / N
    num_pos = np.sum(margins > 0, axis=1)
    dx = np.zeros_like(x)
    dx[margins > 0] = 1
    #大于0的才用求导数
    dx[np.arange(N), y] -= num_pos
    #对于正确标签那一类的梯度计算不同于其它类
    dx /= N
    return loss, dx


def softmax_loss(x, y):
    """
    Computes the loss and gradient for softmax classification.

    Inputs:
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
      class for the ith input.
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
      0 <= y[i] < C

    Returns a tuple of:
    - loss: Scalar giving the loss
    - dx: Gradient of the loss with respect to x
    """
    shifted_logits = x - np.max(x, axis=1, keepdims=True) #将每行中的数值进行平移，使得最大值为0
    Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
    log_probs = shifted_logits - np.log(Z)
    probs = np.exp(log_probs)
    N = x.shape[0]
    loss = -np.sum(log_probs[np.arange(N), y]) / N
    dx = probs.copy()
    dx[np.arange(N), y] -= 1
    #令其中每张样本图片(每行)对应于正确标签的得分都减一，再配以系数1/N之后，就得到了损失函数关于输入矩阵z的“梯度矩阵” dz
    
    #在例子中，对probs矩阵确切的切片含义是 probs[np.array([0, 1 ,2]), np.array([2, 0, 1])]
    #这就像是定义了经纬度一样，指定了确切的行列数，要求切片出相应的数值。对于上面的例子而已，就是说取出第0行、第2列的值；取出第1行、第0列的值；取出第2  行、第1列的值。于是，就得到了例子中的红色得分数值。切行数时，np.arange(N) 相当于是说“我每行都要切一下哦～”，而切列数时，y 向量(array)所存的数值型分类标签(0~9)，刚好可以对应于probs矩阵每列的index(0~9)，如果 y = np.array(['cat', 'dog', 'ship']) ，显然代码还这么写就会出问题了。
    dx /= N
    return loss, dx

Testing svm_loss:
loss: 8.999602749096233
dx error: 1.4021566006651672e-09

Testing softmax_loss:
loss: 2.302545844500738
dx error: 9.384673161989355e-09

两层的网络:

class TwoLayerNet(object):
    """
    A two-layer fully-connected neural network with ReLU nonlinearity and
    softmax loss that uses a modular layer design. We assume an input dimension
    of D, a hidden dimension of H, and perform classification over C classes.

    The architecure should be affine - relu - affine - softmax.

    Note that this class does not implement gradient descent; instead, it
    will interact with a separate Solver object that is responsible for running
    optimization.

    The learnable parameters of the model are stored in the dictionary
    self.params that maps parameter names to numpy arrays.
    """

    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
                 weight_scale=1e-3, reg=0.0): ## weight_scale:初始化参数的权重尺度(标准偏差)
        """
        Initialize a new network.

        Inputs:
        - input_dim: An integer giving the size of the input
        - hidden_dim: An integer giving the size of the hidden layer
        - num_classes: An integer giving the number of classes to classify
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - reg: Scalar giving L2 regularization strength.
        """
        self.params = {}
        self.reg = reg

        ############################################################################
        # TODO: Initialize the weights and biases of the two-layer net. Weights    #
        # should be initialized from a Gaussian centered at 0.0 with               #
        # standard deviation equal to weight_scale, and biases should be           #
        # initialized to zero. All weights and biases should be stored in the      #
        # dictionary self.params, with first layer weights                         #
        # and biases using the keys 'W1' and 'b1' and second layer                 #
        # weights and biases using the keys 'W2' and 'b2'.                         #
        ############################################################################
        # randn函数是基于零均值和标准差的一个高斯分布
        self.params['W1'] = weight_scale * np.random.randn(input_dim,hidden_dim) #(3072,100)
        self.params['b1'] = np.zeros((hidden_dim,)) #100
        self.params['W2'] = weight_scale * np.random.randn(hidden_dim,num_classes) #(100,10)
        self.params['b2'] = np.zeros((num_classes,)) #10
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################


    def loss(self, X, y=None):
        """
        Compute loss and gradient for a minibatch of data.

        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
          scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
          names to gradients of the loss with respect to those parameters.
        """
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the two-layer net, computing the    #
        # class scores for X and storing them in the scores variable.              #
        ############################################################################
        #前向传播
        h1_out,h1_cache = affine_relu_forward(X,self.params['W1'],self.params['b1'])
        scores,out_cache = affine_forward(h1_out,self.params['W2'],self.params['b2'])
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If y is None then we are in test mode so just return scores
        if y is None:
            return scores

        loss, grads = 0, {}
        ############################################################################
        # TODO: Implement the backward pass for the two-layer net. Store the loss  #
        # in the loss variable and gradients in the grads dictionary. Compute data #
        # loss using softmax, and make sure that grads[k] holds the gradients for  #
        # self.params[k]. Don't forget to add L2 regularization!                   #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        #后向传播，计算loss和梯度
        loss,dout = softmax_loss(scores,y)
        dout,dw2,db2 = affine_backward(dout,out_cache)
        loss += 0.5 * self.reg * (np.sum(self.params['W1'] ** 2) + np.sum(self.params['W2'] ** 2))
        _,dw1,db1 = affine_relu_backward(dout,h1_cache)
        dw1 += self.reg * self.params['W1']
        dw2 += self.reg * self.params['W2']
        grads['W1'],grads['b1'] = dw1,db1
        grads['W2'],grads['b2'] = dw2,db2
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

Testing initialization ...
Testing test-time forward pass ...
Testing training loss (no regularization)
Running numeric gradient check with reg = 0.0
W1 relative error: 1.83e-08
W2 relative error: 3.12e-10
b1 relative error: 9.83e-09
b2 relative error: 4.33e-10
Running numeric gradient check with reg = 0.7
W1 relative error: 2.53e-07
W2 relative error: 2.85e-08
b1 relative error: 1.56e-08
b2 relative error: 7.76e-10

使用solver来验证

from __future__ import print_function, division
from future import standard_library
standard_library.install_aliases()
from builtins import range
from builtins import object
import os
import pickle as pickle

import numpy as np

from cs231n import optim


class Solver(object):
     """
     我们定义的这个Solver类将会根据我们的神经网络模型框架——FullyConnectedNet()类，
     在数据源的训练集部分和验证集部分中，训练我们的模型，并且通过周期性的检查准确率的方式，
     以避免过拟合。

     在这个类中，包括__init__()，共定义5个函数，其中只有train()函数是最重要的。调用
     它后，会自动启动神经网络模型优化程序。

     训练结束后，经过更新在验证集上优化后的模型参数会保存在model.params中。此外，损失值的
     历史训练信息会保存在solver.loss_history中，还有solver.train_acc_history和
     solver.val_acc_history中会分别保存训练集和验证集在每一次epoch时的模型准确率。
     ===============================
     下面是给出一个Solver类使用的实例：
     data = {
         'X_train': # training data
         'y_train': # training labels
         'X_val': # validation data
     '   y_val': # validation labels
         }   # 以字典的形式存入训练集和验证集的数据和标签
     model = FullyConnectedNet(hidden_size=100, reg=10) # 我们的神经网络模型
     solver = Solver(model, data,            # 模型／数据
                   update_rule='sgd',        # 优化算法
                   optim_config={            # 该优化算法的参数
                     'learning_rate': 1e-3,  # 学习率
                   },
                   lr_decay=0.95,            # 学习率的衰减速率
                   num_epochs=10,            # 训练模型的遍数
                   batch_size=100,           # 每次丢入模型训练的图片数目
                   print_every=100)          
     solver.train()
     ===============================    
     # 神经网络模型中必须要有两个函数方法：模型参数model.params和损失函数model.loss(X, y)
     A Solver works on a model object that must conform to the following API:
     - model.params must be a dictionary mapping string parameter names to numpy
         arrays containing parameter values. # 
     - model.loss(X, y) must be a function that computes training-time loss and
         gradients, and test-time classification scores, with the following inputs
         and outputs:
     Inputs:     # 全局的输入变量
     - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k)
     - y: Array of labels, of shape (N,) giving labels for X where y[i] is the
       label for X[i].
     Returns:    # 全局的输出变量
     # 用标签y的存在与否标记训练mode还是测试mode
     If y is None, run a test-time forward pass and return: # 
     - scores: Array of shape (N, C) giving classification scores for X where
       scores[i, c] gives the score of class c for X[i].
     If y is not None, run a training time forward and backward pass and return
     a tuple of:
     - loss: Scalar giving the loss  # 损失函数值
     - grads: Dictionary with the same keys as self.params mapping parameter
       names to gradients of the loss with respect to those parameters.# 模型梯度
     """ 
    
    def __init__(self, model, data, **kwargs):
        """
        Construct a new Solver instance.

        Required arguments:
        - model: A model object conforming to the API described above
        - data: A dictionary of training and validation data containing:
          'X_train': Array, shape (N_train, d_1, ..., d_k) of training images
          'X_val': Array, shape (N_val, d_1, ..., d_k) of validation images
          'y_train': Array, shape (N_train,) of labels for training images
          'y_val': Array, shape (N_val,) of labels for validation images

        Optional arguments:
        - update_rule: 优化算法，默认为SGD.
        - optim_config: 设置优化算法的超参数
        - lr_decay: 学习率在每个epoch的衰减率
        - batch_size: batch的大小
        - num_epochs: 在训练时，神经网络一次训练的遍数
        - verbose: 是否打印中间过程
        - num_train_samples: Number of training samples used to check training
          accuracy; default is 1000; set to None to use entire training set.
        - num_val_samples: Number of validation samples to use to check val
          accuracy; default is None, which uses the entire validation set.
        - checkpoint_name: If not None, then save model checkpoints here every
          epoch.
        """
        self.model = model
        self.X_train = data['X_train']
        self.y_train = data['y_train']
        self.X_val = data['X_val']
        self.y_val = data['y_val']

        # Unpack keyword arguments
        self.update_rule = kwargs.pop('update_rule', 'sgd')
        self.optim_config = kwargs.pop('optim_config', {})
        self.lr_decay = kwargs.pop('lr_decay', 1.0)
        self.batch_size = kwargs.pop('batch_size', 100)
        self.num_epochs = kwargs.pop('num_epochs', 10)
        self.num_train_samples = kwargs.pop('num_train_samples', 1000)
        self.num_val_samples = kwargs.pop('num_val_samples', None)

        self.checkpoint_name = kwargs.pop('checkpoint_name', None)
        self.print_every = kwargs.pop('print_every', 10)
        self.verbose = kwargs.pop('verbose', True)

        # Throw an error if there are extra keyword arguments 处理异常
        if len(kwargs) > 0:
            extra = ', '.join('"%s"' % k for k in list(kwargs.keys()))
            raise ValueError('Unrecognized arguments %s' % extra)

        # Make sure the update rule exists, then replace the string
        # name with the actual function
        if not hasattr(optim, self.update_rule):
            raise ValueError('Invalid update_rule "%s"' % self.update_rule)
        self.update_rule = getattr(optim, self.update_rule)

        self._reset()

   # 定义我们的 _reset() 函数，其仅在类初始化函数 __init__() 中调用
    def _reset(self):
        """
        Set up some book-keeping variables for optimization. Don't call this
        manually.
        """
        # Set up some variables for book-keeping
        self.epoch = 0
        self.best_val_acc = 0
        self.best_params = {}
        self.loss_history = []
        self.train_acc_history = []
        self.val_acc_history = []

        # Make a deep copy of the optim_config for each parameter
        self.optim_configs = {}
        for p in self.model.params:
            d = {k: v for k, v in self.optim_config.items()}
            self.optim_configs[p] = d


    def _step(self):
        """
        训练模式下，样本图片数据的一次正向和反向传播，并且更新模型参数一次。
        """
        # Make a minibatch of training data
        num_train = self.X_train.shape[0]
        batch_mask = np.random.choice(num_train, self.batch_size)
        X_batch = self.X_train[batch_mask]
        y_batch = self.y_train[batch_mask]

        # Compute loss and gradient
        loss, grads = self.model.loss(X_batch, y_batch)
        self.loss_history.append(loss)

        # Perform a parameter update
        for p, w in self.model.params.items():
            dw = grads[p]
            config = self.optim_configs[p]
            next_w, next_config = self.update_rule(w, dw, config)
            self.model.params[p] = next_w
            self.optim_configs[p] = next_config

    #保存checkpoint
    def _save_checkpoint(self):
        if self.checkpoint_name is None: return
        checkpoint = {
          'model': self.model,
          'update_rule': self.update_rule,
          'lr_decay': self.lr_decay,
          'optim_config': self.optim_config,
          'batch_size': self.batch_size,
          'num_train_samples': self.num_train_samples,
          'num_val_samples': self.num_val_samples,
          'epoch': self.epoch,
          'loss_history': self.loss_history,
          'train_acc_history': self.train_acc_history,
          'val_acc_history': self.val_acc_history,
        }
        filename = '%s_epoch_%d.pkl' % (self.checkpoint_name, self.epoch)
        if self.verbose:
            print('Saving checkpoint to "%s"' % filename)
        with open(filename, 'wb') as f:
            pickle.dump(checkpoint, f)

    #定义我们的 check_accuracy() 函数，其仅在 train() 函数中调用
    def check_accuracy(self, X, y, num_samples=None, batch_size=100):
        """
        Check accuracy of the model on the provided data.

        Inputs:
        - X: Array of data, of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,)
        - num_samples: If not None, subsample the data and only test the model
          on num_samples datapoints.
        - batch_size: Split X and y into batches of this size to avoid using
          too much memory.

        Returns:
        - acc: Scalar giving the fraction of instances that were correctly
          classified by the model.
        """

        # Maybe subsample the data
        N = X.shape[0]
        if num_samples is not None and N > num_samples:
            mask = np.random.choice(N, num_samples)
            N = num_samples
            X = X[mask]
            y = y[mask]

        # Compute predictions in batches
        num_batches = N // batch_size
        if N % batch_size != 0:
            num_batches += 1
        y_pred = []
        for i in range(num_batches):
            start = i * batch_size
            end = (i + 1) * batch_size
            scores = self.model.loss(X[start:end])
            y_pred.append(np.argmax(scores, axis=1))
        y_pred = np.hstack(y_pred)
        acc = np.mean(y_pred == y)

        return acc


    def train(self):
        """
        Run optimization to train the model.
        """
        num_train = self.X_train.shape[0]
        iterations_per_epoch = max(num_train // self.batch_size, 1)
        num_iterations = self.num_epochs * iterations_per_epoch

        for t in range(num_iterations):
            self._step()

            # Maybe print training loss
            if self.verbose and t % self.print_every == 0:
                print('(Iteration %d / %d) loss: %f' % (
                       t + 1, num_iterations, self.loss_history[-1]))

            # At the end of every epoch, increment the epoch counter and decay
            # the learning rate.
            epoch_end = (t + 1) % iterations_per_epoch == 0
            if epoch_end:
                self.epoch += 1
                for k in self.optim_configs:
                    self.optim_configs[k]['learning_rate'] *= self.lr_decay #学习率衰减

            # Check train and val accuracy on the first iteration, the last
            # iteration, and at the end of each epoch.
            first_it = (t == 0)
            last_it = (t == num_iterations - 1)
            if first_it or last_it or epoch_end:
                train_acc = self.check_accuracy(self.X_train, self.y_train,
                    num_samples=self.num_train_samples)
                val_acc = self.check_accuracy(self.X_val, self.y_val,
                    num_samples=self.num_val_samples)
                self.train_acc_history.append(train_acc)
                self.val_acc_history.append(val_acc)
                self._save_checkpoint()

                if self.verbose:
                    print('(Epoch %d / %d) train acc: %f; val_acc: %f' % (
                           self.epoch, self.num_epochs, train_acc, val_acc))

                # Keep track of the best model
                if val_acc > self.best_val_acc:
                    self.best_val_acc = val_acc
                    self.best_params = {}
                    for k, v in self.model.params.items():
                        self.best_params[k] = v.copy()

        # At the end of training swap the best params into the model
        self.model.params = self.best_params

验证准确率：

model = TwoLayerNet()
solver = None

##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #
# 50% accuracy on the validation set.                                        #
##############################################################################
solver = Solver(model, data,
                  update_rule='sgd',
                  optim_config={
                    'learning_rate': 1e-3,
                  },
                  lr_decay=0.95,
                  num_epochs=10, batch_size=128,
                  print_every=100)
solver.train()
solver.best_val_acc
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

(Iteration 1 / 3820) loss: 2.302693
(Epoch 0 / 10) train acc: 0.134000; val_acc: 0.141000
(Iteration 101 / 3820) loss: 1.692782
(Iteration 201 / 3820) loss: 1.687236
(Iteration 301 / 3820) loss: 1.749260
(Epoch 1 / 10) train acc: 0.455000; val_acc: 0.433000
(Iteration 401 / 3820) loss: 1.501709
(Iteration 501 / 3820) loss: 1.549186
(Iteration 601 / 3820) loss: 1.442813
(Iteration 701 / 3820) loss: 1.476939
(Epoch 2 / 10) train acc: 0.493000; val_acc: 0.468000
(Iteration 801 / 3820) loss: 1.287420
(Iteration 901 / 3820) loss: 1.469279
(Iteration 1001 / 3820) loss: 1.475614
(Iteration 1101 / 3820) loss: 1.295445
(Epoch 3 / 10) train acc: 0.486000; val_acc: 0.488000
(Iteration 1201 / 3820) loss: 1.312503
(Iteration 1301 / 3820) loss: 1.478785
(Iteration 1401 / 3820) loss: 1.206321
(Iteration 1501 / 3820) loss: 1.544099
(Epoch 4 / 10) train acc: 0.518000; val_acc: 0.488000
(Iteration 1601 / 3820) loss: 1.234062
(Iteration 1701 / 3820) loss: 1.336020
(Iteration 1801 / 3820) loss: 1.229858
(Iteration 1901 / 3820) loss: 1.347779
(Epoch 5 / 10) train acc: 0.569000; val_acc: 0.499000
(Iteration 2001 / 3820) loss: 1.299783
(Iteration 2101 / 3820) loss: 1.392062
(Iteration 2201 / 3820) loss: 1.277007
(Epoch 6 / 10) train acc: 0.579000; val_acc: 0.500000
(Iteration 2301 / 3820) loss: 1.442022
(Iteration 2401 / 3820) loss: 1.411056
(Iteration 2501 / 3820) loss: 1.205100
(Iteration 2601 / 3820) loss: 1.179498
(Epoch 7 / 10) train acc: 0.548000; val_acc: 0.485000
(Iteration 2701 / 3820) loss: 1.252322
(Iteration 2801 / 3820) loss: 1.113809
(Iteration 2901 / 3820) loss: 1.164096
(Iteration 3001 / 3820) loss: 1.216631
(Epoch 8 / 10) train acc: 0.584000; val_acc: 0.510000
(Iteration 3101 / 3820) loss: 1.138006
(Iteration 3201 / 3820) loss: 1.231227
(Iteration 3301 / 3820) loss: 1.005646
(Iteration 3401 / 3820) loss: 1.003769
(Epoch 9 / 10) train acc: 0.602000; val_acc: 0.516000
(Iteration 3501 / 3820) loss: 1.329801
(Iteration 3601 / 3820) loss: 1.253133
(Iteration 3701 / 3820) loss: 1.059002
(Iteration 3801 / 3820) loss: 1.080007
(Epoch 10 / 10) train acc: 0.614000; val_acc: 0.497000

0.516

多层的全连接网络:

class FullyConnectedNet(object):
    """
    A fully-connected neural network with an arbitrary number of hidden layers,
    ReLU nonlinearities, and a softmax loss function. This will also implement
    dropout and batch/layer normalization as options. For a network with L layers,
    the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional, and the {...} block is
    repeated L - 1 times.

    Similar to the TwoLayerNet above, learnable parameters are stored in the
    self.params dictionary and will be learned using the Solver class.
    """

    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
                 dropout=1, normalization=None, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):
        """
        Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
          the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
          are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model. 默认无随机种子，若有会传递给dropout层。
        """
        self.normalization = normalization
        self.use_dropout = dropout != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        #初始化所有隐藏层的参数
        in_dim = input_dim #D
        for i,h_dim in enumerate(hidden_dims): #(0,H1)(1,H2)
            self.params['W%d' %(i+1,)] = weight_scale * np.random.randn(in_dim,h_dim)
            self.params['b%d' %(i+1,)] = np.zeros((h_dim,))
            if self.normalization=='batchnorm':
                self.params['gamma%d' %(i+1,)] = np.ones((h_dim,)) #初始化为1
                self.params['beta%d' %(i+1,)] = np.zeros((h_dim,)) #初始化为0
            in_dim = h_dim #将该层的列数传递给下一层的行数
            
        #初始化所有输出层的参数
        self.params['W%d' %(self.num_layers,)] = weight_scale * np.random.randn(in_dim,num_classes)
        self.params['b%d' %(self.num_layers,)] = np.zeros((num_classes,))
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        #  当开启 dropout 时，我们需要在每一个神经元层中传递一个相同的 dropout 参数字典 self.dropout_param ，以保证每一层的神经元们 都知晓失活概率p和当前神经网络的模式状态mode(训练／测试)。 
        self.dropout_param = {} #dropout的参数字典
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed

        #  当开启批量归一化时，我们要定义一个BN算法的参数列表 self.bn_params ， 以用来跟踪记录每一层的平均值和标准差。其中，第0个元素 self.bn_params[0] 表示前向传播第1个BN层的参数，第1个元素 self.bn_params[1] 表示前向传播 第2个BN层的参数，以此类推。
        self.bn_params = [] #BN的参数字典
        if self.normalization=='batchnorm':
            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
        if self.normalization=='layernorm':
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.

        Input / output: Same as TwoLayerNet above.
        """
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.normalization=='batchnorm':
            for bn_param in self.bn_params:
                bn_param['mode'] = mode
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        fc_mix_cache = {} # # 初始化每层前向传播的缓冲字典
        if self.use_dropout: # 如果开启了dropout，初始化其对应的缓冲字典
            dp_cache = {}
        # 从第一个隐藏层开始循环每一个隐藏层，传递数据out，保存每一层的缓冲cache
        out = X
        for i in range(self.num_layers - 1): # 在每个hidden层中循环
            w,b = self.params['W%d' %(i+1,)],self.params['b%d' %(i+1,)]
            if self.normalization == 'batchnorm':
                gamma = self.params['gamma%d' %(i+1,)]
                beta = self.params['beta%d' %(i+1,)]
                out,fc_mix_cache[i] = affine_bn_relu_forward(out,w,b,gamma,beta,self.bn_params[i])
            else:
                out,fc_mix_cache[i] = affine_relu_forward(out,w,b)
            if self.use_dropout:
                out,dp_cache[i] = dropout_forward(out,self.dropout_param)
        #最后的输出层
        w = self.params['W%d' %(self.num_layers,)]
        b = self.params['b%d' %(self.num_layers,)]
        out,out_cache = affine_forward(out,w,b)
        scores = out
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early
        if mode == 'test':
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the scale   #
        # and shift parameters.                                                    #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        loss,dout = softmax_loss(scores,y)
        loss += 0.5 * self.reg * np.sum(self.params['W%d' %(self.num_layers,)] ** 2)
        # 在输出层处梯度的反向传播，顺便把梯度保存在梯度字典 grad 中：
        dout,dw,db = affine_backward(dout,out_cache)
        grads['W%d' %(self.num_layers,)] = dw + self.reg * self.params['W%d' %(self.num_layers,)]
        grads['b%d' %(self.num_layers,)] = db
        # 在每一个隐藏层处梯度的反向传播，不仅顺便更新了梯度字典 grad，还迭代算出了损失值loss
        for i in range(self.num_layers - 1):
            ri = self.num_layers - 2 - i #倒数第ri+1隐藏层
            loss += 0.5 * self.reg * np.sum(self.params['W%d' %(ri+1,)] ** 2) #迭代地补上每层的正则项给loss
            if self.use_dropout:
                dout = dropout_backward(dout,dp_cache[ri])
            if self.normalization == 'batchnorm':
                dout,dw,db,dgamma,dbeta = affine_bn_relu_backward(dout,fc_mix_cache[ri])
                grads['gamma%d' %(ri+1,)] = dgamma
                grads['beta%d' %(ri+1,)] = dbeta
            else:
                dout,dw,db = affine_relu_backward(dout,fc_mix_cache[ri])
            grads['W%d' %(ri+1,)] = dw + self.reg * self.params['W%d' %(ri+1,)]
            grads['b%d' %(ri+1,)] = db
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

初始loss和权重检查：
Running check with reg = 0
Initial loss: 2.3004790897684924
W1 relative error: 1.48e-07
W2 relative error: 2.21e-05
W3 relative error: 3.53e-07
b1 relative error: 5.38e-09
b2 relative error: 2.09e-09
b3 relative error: 5.80e-11
Running check with reg = 3.14
Initial loss: 7.052114776533016
W1 relative error: 7.36e-09
W2 relative error: 6.87e-08
W3 relative error: 3.48e-08
b1 relative error: 1.48e-08
b2 relative error: 1.72e-09
b3 relative error: 1.80e-10

现在用一个3层网络来overfit一个小数据集(50张)：

(Iteration 1 / 40) loss: 2.329128
(Epoch 0 / 20) train acc: 0.140000; val_acc: 0.120000
(Epoch 1 / 20) train acc: 0.160000; val_acc: 0.123000
(Epoch 2 / 20) train acc: 0.240000; val_acc: 0.130000
(Epoch 3 / 20) train acc: 0.340000; val_acc: 0.133000
(Epoch 4 / 20) train acc: 0.380000; val_acc: 0.131000
(Epoch 5 / 20) train acc: 0.460000; val_acc: 0.135000
(Iteration 11 / 40) loss: 2.130744
(Epoch 6 / 20) train acc: 0.420000; val_acc: 0.133000
(Epoch 7 / 20) train acc: 0.520000; val_acc: 0.149000
(Epoch 8 / 20) train acc: 0.540000; val_acc: 0.151000
(Epoch 9 / 20) train acc: 0.520000; val_acc: 0.146000
(Epoch 10 / 20) train acc: 0.500000; val_acc: 0.147000
(Iteration 21 / 40) loss: 1.984555
(Epoch 11 / 20) train acc: 0.520000; val_acc: 0.152000
(Epoch 12 / 20) train acc: 0.580000; val_acc: 0.153000
(Epoch 13 / 20) train acc: 0.560000; val_acc: 0.146000
(Epoch 14 / 20) train acc: 0.600000; val_acc: 0.142000
(Epoch 15 / 20) train acc: 0.560000; val_acc: 0.137000
(Iteration 31 / 40) loss: 1.950822
(Epoch 16 / 20) train acc: 0.520000; val_acc: 0.146000
(Epoch 17 / 20) train acc: 0.540000; val_acc: 0.143000
(Epoch 18 / 20) train acc: 0.540000; val_acc: 0.149000
(Epoch 19 / 20) train acc: 0.520000; val_acc: 0.141000
(Epoch 20 / 20) train acc: 0.540000; val_acc: 0.141000

发现loss下降较慢，判断学习率太小了
将learning_rate 设置为 1e-2

(Iteration 1 / 40) loss: 2.330135
(Epoch 0 / 20) train acc: 0.260000; val_acc: 0.097000
(Epoch 1 / 20) train acc: 0.280000; val_acc: 0.109000
(Epoch 2 / 20) train acc: 0.280000; val_acc: 0.129000
(Epoch 3 / 20) train acc: 0.580000; val_acc: 0.146000
(Epoch 4 / 20) train acc: 0.640000; val_acc: 0.133000
(Epoch 5 / 20) train acc: 0.620000; val_acc: 0.176000
(Iteration 11 / 40) loss: 1.567106
(Epoch 6 / 20) train acc: 0.600000; val_acc: 0.176000
(Epoch 7 / 20) train acc: 0.720000; val_acc: 0.122000
(Epoch 8 / 20) train acc: 0.880000; val_acc: 0.162000
(Epoch 9 / 20) train acc: 0.920000; val_acc: 0.160000
(Epoch 10 / 20) train acc: 0.920000; val_acc: 0.187000
(Iteration 21 / 40) loss: 0.496118
(Epoch 11 / 20) train acc: 0.980000; val_acc: 0.175000
(Epoch 12 / 20) train acc: 0.920000; val_acc: 0.156000
(Epoch 13 / 20) train acc: 0.960000; val_acc: 0.179000
(Epoch 14 / 20) train acc: 0.980000; val_acc: 0.182000
(Epoch 15 / 20) train acc: 1.000000; val_acc: 0.175000
(Iteration 31 / 40) loss: 0.076210
(Epoch 16 / 20) train acc: 1.000000; val_acc: 0.192000
(Epoch 17 / 20) train acc: 1.000000; val_acc: 0.180000
(Epoch 18 / 20) train acc: 1.000000; val_acc: 0.173000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.178000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.175000

成功的overfit，达到100%的准确率。

接着测试一个5层网络来overfit50张照片。
使用初始参数：

(Iteration 1 / 40) loss: 2.302585
(Epoch 0 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 1 / 20) train acc: 0.100000; val_acc: 0.107000
(Epoch 2 / 20) train acc: 0.100000; val_acc: 0.107000
(Epoch 3 / 20) train acc: 0.120000; val_acc: 0.105000
(Epoch 4 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 5 / 20) train acc: 0.160000; val_acc: 0.112000
(Iteration 11 / 40) loss: 2.302211
(Epoch 6 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 7 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 8 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 9 / 20) train acc: 0.160000; val_acc: 0.079000
(Epoch 10 / 20) train acc: 0.160000; val_acc: 0.112000
(Iteration 21 / 40) loss: 2.301766
(Epoch 11 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 12 / 20) train acc: 0.160000; val_acc: 0.079000
(Epoch 13 / 20) train acc: 0.160000; val_acc: 0.079000
(Epoch 14 / 20) train acc: 0.160000; val_acc: 0.079000
(Epoch 15 / 20) train acc: 0.160000; val_acc: 0.079000
(Iteration 31 / 40) loss: 2.302234
(Epoch 16 / 20) train acc: 0.160000; val_acc: 0.079000
(Epoch 17 / 20) train acc: 0.160000; val_acc: 0.079000
(Epoch 18 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 19 / 20) train acc: 0.160000; val_acc: 0.112000
(Epoch 20 / 20) train acc: 0.160000; val_acc: 0.079000

调整weight_scale = 5e-2之后：
(Iteration 1 / 40) loss: 3.445131
(Epoch 0 / 20) train acc: 0.160000; val_acc: 0.099000
(Epoch 1 / 20) train acc: 0.200000; val_acc: 0.101000
(Epoch 2 / 20) train acc: 0.380000; val_acc: 0.112000
(Epoch 3 / 20) train acc: 0.500000; val_acc: 0.127000
(Epoch 4 / 20) train acc: 0.600000; val_acc: 0.144000
(Epoch 5 / 20) train acc: 0.700000; val_acc: 0.127000
(Iteration 11 / 40) loss: 1.105333
(Epoch 6 / 20) train acc: 0.700000; val_acc: 0.137000
(Epoch 7 / 20) train acc: 0.800000; val_acc: 0.137000
(Epoch 8 / 20) train acc: 0.860000; val_acc: 0.137000
(Epoch 9 / 20) train acc: 0.860000; val_acc: 0.132000
(Epoch 10 / 20) train acc: 0.900000; val_acc: 0.130000
(Iteration 21 / 40) loss: 0.608579
(Epoch 11 / 20) train acc: 0.940000; val_acc: 0.131000
(Epoch 12 / 20) train acc: 0.980000; val_acc: 0.122000
(Epoch 13 / 20) train acc: 0.980000; val_acc: 0.123000
(Epoch 14 / 20) train acc: 0.960000; val_acc: 0.130000
(Epoch 15 / 20) train acc: 0.980000; val_acc: 0.132000
(Iteration 31 / 40) loss: 0.437144
(Epoch 16 / 20) train acc: 0.980000; val_acc: 0.125000
(Epoch 17 / 20) train acc: 0.980000; val_acc: 0.123000
(Epoch 18 / 20) train acc: 0.980000; val_acc: 0.128000
(Epoch 19 / 20) train acc: 1.000000; val_acc: 0.129000
(Epoch 20 / 20) train acc: 1.000000; val_acc: 0.120000

SGD+monentum

def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    v = config['momentum'] * v - config['learning_rate'] * dw
    next_w = w + v
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    config['velocity'] = v

    return next_w, config

next_w error: 8.882347033505819e-09
velocity error: 4.269287743278663e-09

SGD和SGD with momentum的比较
running with sgd
(Iteration 1 / 200) loss: 2.507323
(Epoch 0 / 5) train acc: 0.102000; val_acc: 0.092000
(Iteration 11 / 200) loss: 2.208203
(Iteration 21 / 200) loss: 2.210458
(Iteration 31 / 200) loss: 2.118780
(Epoch 1 / 5) train acc: 0.251000; val_acc: 0.225000
(Iteration 41 / 200) loss: 2.059379
(Iteration 51 / 200) loss: 2.031150
(Iteration 61 / 200) loss: 1.991460
(Iteration 71 / 200) loss: 1.889502
(Epoch 2 / 5) train acc: 0.311000; val_acc: 0.286000
(Iteration 81 / 200) loss: 1.884040
(Iteration 91 / 200) loss: 1.884515
(Iteration 101 / 200) loss: 1.923375
(Iteration 111 / 200) loss: 1.737657
(Epoch 3 / 5) train acc: 0.343000; val_acc: 0.309000
(Iteration 121 / 200) loss: 1.689422
(Iteration 131 / 200) loss: 1.709433
(Iteration 141 / 200) loss: 1.799477
(Iteration 151 / 200) loss: 1.809359
(Epoch 4 / 5) train acc: 0.415000; val_acc: 0.336000
(Iteration 161 / 200) loss: 1.599980
(Iteration 171 / 200) loss: 1.732295
(Iteration 181 / 200) loss: 1.740551
(Iteration 191 / 200) loss: 1.634729
(Epoch 5 / 5) train acc: 0.403000; val_acc: 0.354000

running with sgd_momentum
(Iteration 1 / 200) loss: 2.677090
(Epoch 0 / 5) train acc: 0.100000; val_acc: 0.092000
(Iteration 11 / 200) loss: 2.118401
(Iteration 21 / 200) loss: 2.122486
(Iteration 31 / 200) loss: 1.851282
(Epoch 1 / 5) train acc: 0.326000; val_acc: 0.287000
(Iteration 41 / 200) loss: 1.852963
(Iteration 51 / 200) loss: 1.920911
(Iteration 61 / 200) loss: 1.798175
(Iteration 71 / 200) loss: 1.714354
(Epoch 2 / 5) train acc: 0.386000; val_acc: 0.303000
(Iteration 81 / 200) loss: 1.882377
(Iteration 91 / 200) loss: 1.572796
(Iteration 101 / 200) loss: 1.854254
(Iteration 111 / 200) loss: 1.500233
(Epoch 3 / 5) train acc: 0.480000; val_acc: 0.348000
(Iteration 121 / 200) loss: 1.516018
(Iteration 131 / 200) loss: 1.592710
(Iteration 141 / 200) loss: 1.524653
(Iteration 151 / 200) loss: 1.340690
(Epoch 4 / 5) train acc: 0.478000; val_acc: 0.321000
(Iteration 161 / 200) loss: 1.297253
(Iteration 171 / 200) loss: 1.460615
(Iteration 181 / 200) loss: 1.113488
(Iteration 191 / 200) loss: 1.550920
(Epoch 5 / 5) train acc: 0.512000; val_acc: 0.327000

发现SGD+momentum的loss下降更快

测试RMSprop

def rmsprop(w, dw, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('decay_rate', 0.99)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('cache', np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of w #
    # in the next_w variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    config['cache'] = config['cache'] * config['decay_rate'] + (1 - config['decay_rate']) *dw * dw #让累积的平方梯度按照一定比率下降
    next_w = w - config['learning_rate'] * dw / np.sqrt(config['cache'] + config['epsilon'])
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config

next_w error: 9.502645229894295e-08
cache error: 2.6477955807156126e-09

测试adam:

def adam(w, dw, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-3)
    config.setdefault('beta1', 0.9)
    config.setdefault('beta2', 0.999)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('m', np.zeros_like(w))
    config.setdefault('v', np.zeros_like(w))
    config.setdefault('t', 0)

    next_w = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of w in #
    # the next_w variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    #                                                                         #
    # NOTE: In order to match the reference output, please modify t _before_  #
    # using it in any calculations.                                           #
    ###########################################################################
    m = config['m'] * config['beta1'] + (1 - config['beta1']) * dw
    v = config['v'] * config['beta2'] + (1 - config['beta2']) * dw * dw
    config['t'] = 1
    mb = m / (1 - config['beta1'] ** config['t'])
    vb = v / (1 - config['beta2'] ** config['t'])
    next_w = w - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon']) #综合了momentum和RMSProp
    config['m'] = m
    config['v'] = v
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_w, config

next_w error: 0.032064274004801614
v error: 4.208314038113071e-09
m error: 4.214963193114416e-09

三种优化算法在训练中的比较：

训练一个网络！一个3层的全连接网络，有dropout和batchnorm。

best_model = None
################################################################################
# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might   #
# find batch/layer normalization and dropout useful. Store your best model in  #
# the best_model variable.                                                     #
################################################################################
dropout = 0.25
weight_scale = 2e-2
lr = 1e-3
hidden_dims = [1024,1024,1024]
best_model = FullyConnectedNet(hidden_dims = hidden_dims,num_classes = 10,
                          weight_scale = weight_scale,normalization='batchnorm',
                          dropout = dropout)
solver = Solver(model,data,num_epochs = 10,batch_size = 128,print_every = 100,
                update_rule = 'adam',verbose = True,optim_config = {'learning_rate': lr})
solver.train()
plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()
################################################################################
#                              END OF YOUR CODE                                #
################################################################################

Iteration = 49000//128*20 = 7640

(Iteration 1 / 7640) loss: 0.859818
(Epoch 0 / 20) train acc: 0.639000; val_acc: 0.464000
(Iteration 101 / 7640) loss: 1.252969
(Iteration 201 / 7640) loss: 1.065346
(Iteration 301 / 7640) loss: 0.863103
(Epoch 1 / 20) train acc: 0.620000; val_acc: 0.493000
(Iteration 401 / 7640) loss: 1.136908
(Iteration 501 / 7640) loss: 0.960427
(Iteration 601 / 7640) loss: 0.967821
(Iteration 701 / 7640) loss: 0.889575
(Epoch 2 / 20) train acc: 0.661000; val_acc: 0.511000
(Iteration 801 / 7640) loss: 0.842937
(Iteration 901 / 7640) loss: 0.948296
(Iteration 1001 / 7640) loss: 1.020871
(Iteration 1101 / 7640) loss: 1.042730
(Epoch 3 / 20) train acc: 0.668000; val_acc: 0.515000
(Iteration 1201 / 7640) loss: 0.848133
(Iteration 1301 / 7640) loss: 0.824475
(Iteration 1401 / 7640) loss: 0.901598
(Iteration 1501 / 7640) loss: 0.835679
(Epoch 4 / 20) train acc: 0.700000; val_acc: 0.488000
(Iteration 1601 / 7640) loss: 0.692962
(Iteration 1701 / 7640) loss: 0.883259
(Iteration 1801 / 7640) loss: 0.751739
(Iteration 1901 / 7640) loss: 0.834902
(Epoch 5 / 20) train acc: 0.695000; val_acc: 0.511000
(Iteration 2001 / 7640) loss: 0.840407
(Iteration 2101 / 7640) loss: 0.736310
(Iteration 2201 / 7640) loss: 0.736240
(Epoch 6 / 20) train acc: 0.725000; val_acc: 0.499000
(Iteration 2301 / 7640) loss: 0.862586
(Iteration 2401 / 7640) loss: 0.927217
(Iteration 2501 / 7640) loss: 0.755900
(Iteration 2601 / 7640) loss: 0.585035
(Epoch 7 / 20) train acc: 0.754000; val_acc: 0.516000
(Iteration 2701 / 7640) loss: 0.620836
(Iteration 2801 / 7640) loss: 0.659957
(Iteration 2901 / 7640) loss: 0.599932
(Iteration 3001 / 7640) loss: 0.609260
(Epoch 8 / 20) train acc: 0.771000; val_acc: 0.510000
(Iteration 3101 / 7640) loss: 0.783430
(Iteration 3201 / 7640) loss: 0.566388
(Iteration 3301 / 7640) loss: 0.604077
(Iteration 3401 / 7640) loss: 0.515016
(Epoch 9 / 20) train acc: 0.782000; val_acc: 0.510000
(Iteration 3501 / 7640) loss: 0.745964
(Iteration 3601 / 7640) loss: 0.862417
(Iteration 3701 / 7640) loss: 0.528430
(Iteration 3801 / 7640) loss: 0.662338
(Epoch 10 / 20) train acc: 0.765000; val_acc: 0.510000
(Iteration 3901 / 7640) loss: 0.639553
(Iteration 4001 / 7640) loss: 0.685763
(Iteration 4101 / 7640) loss: 0.748629
(Iteration 4201 / 7640) loss: 0.620021
(Epoch 11 / 20) train acc: 0.799000; val_acc: 0.507000
(Iteration 4301 / 7640) loss: 0.646508
(Iteration 4401 / 7640) loss: 0.597432
(Iteration 4501 / 7640) loss: 0.666086
(Epoch 12 / 20) train acc: 0.804000; val_acc: 0.499000
(Iteration 4601 / 7640) loss: 0.619035
(Iteration 4701 / 7640) loss: 0.685448
(Iteration 4801 / 7640) loss: 0.786623
(Iteration 4901 / 7640) loss: 0.566107
(Epoch 13 / 20) train acc: 0.815000; val_acc: 0.511000
(Iteration 5001 / 7640) loss: 0.551514
(Iteration 5101 / 7640) loss: 0.597256
(Iteration 5201 / 7640) loss: 0.643402
(Iteration 5301 / 7640) loss: 0.524270
(Epoch 14 / 20) train acc: 0.802000; val_acc: 0.501000
(Iteration 5401 / 7640) loss: 0.569950
(Iteration 5501 / 7640) loss: 0.522419
(Iteration 5601 / 7640) loss: 0.644923
(Iteration 5701 / 7640) loss: 0.513421
(Epoch 15 / 20) train acc: 0.813000; val_acc: 0.505000
(Iteration 5801 / 7640) loss: 0.489016
(Iteration 5901 / 7640) loss: 0.408196
(Iteration 6001 / 7640) loss: 0.382298
(Iteration 6101 / 7640) loss: 0.540364
(Epoch 16 / 20) train acc: 0.840000; val_acc: 0.503000
(Iteration 6201 / 7640) loss: 0.418339
(Iteration 6301 / 7640) loss: 0.578868
(Iteration 6401 / 7640) loss: 0.412187
(Epoch 17 / 20) train acc: 0.835000; val_acc: 0.504000
(Iteration 6501 / 7640) loss: 0.541283
(Iteration 6601 / 7640) loss: 0.462409
(Iteration 6701 / 7640) loss: 0.509253
(Iteration 6801 / 7640) loss: 0.505827
(Epoch 18 / 20) train acc: 0.841000; val_acc: 0.494000
(Iteration 6901 / 7640) loss: 0.476122
(Iteration 7001 / 7640) loss: 0.528972
(Iteration 7101 / 7640) loss: 0.533508
(Iteration 7201 / 7640) loss: 0.598713
(Epoch 19 / 20) train acc: 0.845000; val_acc: 0.486000
(Iteration 7301 / 7640) loss: 0.473737
(Iteration 7401 / 7640) loss: 0.443160
(Iteration 7501 / 7640) loss: 0.332309
(Iteration 7601 / 7640) loss: 0.300785
(Epoch 20 / 20) train acc: 0.858000; val_acc: 0.497000

参考：https://blog.csdn.net/BigDataDigest/article/details/79286510