【深度学习】深度学习中监督优化入门(A Primer on Supervised Optimization for Deep Learning)

简介

这个教程涵盖了深度学习(Deep Learning)的一些重要概念，是一个快速入门的大纲教程，包含了三个部分：

第一部分-数据集：介绍了MNIST数据集和使用方法；
第二部分-标记法：介绍了主要概念的符号标记方法；
第三部分-监督优化入门：介绍了一些深度学习的重要概念；

1.数据集

1.MNIST数据集

这个数据集(mnist.pkl.gz)可以在CSDN下载中免费下载：http://download.csdn.net/detail/ws_20100/9224993

MNIST数据集包含手写字符图像，其中有60000个是训练样本，而10000个是测试样本。对于很多论文，包含这篇教程，60000个训练样本被分为50000个样本作训练集，10000个样本作验证集（用于选择超参数，例如学习率和模型大小）。所有的数字图像在大小上都已经规范化并且处于28×28像素的中心位置。所有的图像像素点的值都处于0到255之间，其中，0代表黑色，255代表白色，中间的值为不同等级的灰色。

这些是一些MNIST数据集样本：
这里写图片描述

为了方便我们在python中使用MNIST数据集，我们对MNIST数据集进行处理。处理后的数据集有3个列表：训练集，验证集和测试集。每个列表包含多个二元组，每个二元组包含一个图像和对应的标签。图像被表示为一个1×784(28×28)维数组，数组中每个元素在0到1之间，0代表黑，1代表白。标签是一个0到9的数值。加载数据集的代码如下：

import cPickle, gzip, numpy

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()

当使用数据集的时候，我们通常将其分成多个minibatch（详细的请看下面的随机梯度下降法）。在实际使用时，最好使用共享变量，因为共享变量和GPU相关。如果不使用共享变量，使用GPU运算可能不比CPU快多少，可能更慢。存取数据，读取minibatch的代码如下：

def shared_dataset(data_xy):
    """ Function that loads the dataset into shared variables

    The reason we store our dataset in shared variables is to allow
    Theano to copy it into the GPU memory (when code is run on GPU).
    Since copying data into the GPU is slow, copying a minibatch everytime
    is needed (the default behaviour if the data is not in a shared
    variable) would lead to a large decrease in performance.
    """
    data_x, data_y = data_xy
    shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
    shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
    # When storing data on the GPU it has to be stored as floats
    # therefore we will store the labels as ``floatX`` as well
    # (``shared_y`` does exactly that). But during our computations
    # we need them as ints (we use labels as index, and if they are
    # floats it doesn't make sense) therefore instead of returning
    # ``shared_y`` we will have to cast it to int. This little hack
    # lets us get around this issue
    return shared_x, T.cast(shared_y, 'int32')

test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)

batch_size = 500    # size of the minibatch

# accessing the third minibatch of the training set

data  = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]

If you are running your code on the GPU and the dataset you are using is too large to fit in memory the code will crash. In such a case you should store the data in a shared variable. You can however store a sufficiently small chunk of your data (several minibatches) in a shared variable and use that during training. Once you got through the chunk, update the values it stores. This way you minimize the number of data transfers between CPU memory and GPU memory.

2.标记法

1.数据集标记法

我们将数据集标记为D，如果需要标注不同的数据集，我们将训练集，验证集和测试集分别标注为Dtrain，Dvalid，Dtest。验证集用于完成模型选择和超参数选择，而测试集用于评估最终的泛化误差，并公平的比较不同算法的性能。
这个教程多数算法都是用于处理分类问题，所以数据集D是一个二元组序列(x(i),y(i))，我们使用上标来区别不同的训练集样本：x(i)∈RD，其中第i个训练样本是D维向量。相似的，yi∈{0,...,L}，其中第i个标签对应于x(i)的输入。显然，我们可以将样本对应的标签yi扩展到其他类型(例如，用于回归的高斯过程，或者用于预测多个符号的多项式组)。

2.数学约定

W：大写符号代表一个矩阵（除非特殊指定）。
Wij：矩阵W的第i行第j列元素。
Wi⋅,Wi：矩阵W的第i行。
W⋅j：矩阵W的第j列。
b：小写符号代表一个向量（除非特殊指定）。
bi：向量b的第i个元素。

3.符号和缩写列表

D：输入向量的维数。
D(i)h：第i层隐节点数量。
fθ(x),f(x)：与模型P(Y|x,θ)相关的分类函数，定义为argmaxkP(Y=k|x,θ)。注意通常省略下标θ。
L：标签的数量。
L(θ,D)：含有参数θ关于D的对数似然。
ℓ(θ,D)：在数据集D上，含有参数θ的预测函数f的经验损失。
NLL：负对数似然。
θ：一个给定模型的所有参数组成的集。

4.Python命名空间

教程的代码通常使用以下的命名空间：

import theano
import theano.tensor as T
import numpy

3.监督优化入门

深度学习(Deep Learning)中最大的特点，就是大量使用深度网络的无监督学习(unsupervised learning)。但是监督学习仍然扮演着非常重要的角色。非监督预学习(pre-training)的作用在于，评估（在监督精细迭代(fine-tuning)之后）网络可以达到的性能。这节回顾了分类模型中监督学习的理论基础，并且包含了多数模型中精细迭代所需要的小批量数据的随机梯度下降算法(minibatch stochastic gradient descent algorithm)。

1.学习一个分类器

1.)0-1损失

在这个深度学习教程中出现的模型大多是用于分类器设计。训练分类器的目的在于，对于未见过的样本，最小化错误分类(0-1损失)的数量。如果f:RD→{0,...,L}是预测函数，那么它的损失值可以写为：

ℓ 0, 1 = \sum i = 0 | D | I f (x (i)) \neq y (i)

D，要么是（在训练过程中的）训练集，要么存在

D∩Dtrain≠ϕ（防止在评估测试误差时出现偏差）。

I是指示函数，定义为：

I x = {10 if x is True otherwise

在这个教程中，

f定义为：

f (x) = a r g m a x k P (Y = k | x, θ)

在Python里面，使用Theano可以写为：

# zero_one_loss is a Theano variable representing a symbolic
# expression of the zero one loss ; to get the actual value this
# symbolic expression has to be compiled into a Theano function (see
# the Theano tutorial for more details)
zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y))

2.)负对数似然损失

由于0-1损失是不可微的，所以在大模型中使用它，会存在成千上万的系数，会不可避免地增加繁重的计算量。所以，对于训练集对应的标签，我们最大化分类器的对数似然。

L (θ, D) = \sum i = 0 | D | l o g P (Y = y (i) | x (i), θ)

正确类别的似然和正确预测的数量并不完全一致，但是从随机初始分类器的角度来看，它们是很相似的。但是要记住，似然函数和0-1损失具有不同的目标，你应该知道它们在验证集上相互联系，但有时一个比另一个要高，而有时它们又差不多大小。
由于我们通常要最小化一个损失函数，学习过程试图最小化负对数似然（negative log-likelihood，NLL）定义如下：

N L L (θ, D) = - \sum i = 0 | D | l o g P (Y = y (i) | x (i), θ)

我们分类器的NLL是0-1损失的一种可微的代理函数，我们使用训练集上函数的梯度作为深度学习分类器的监督学习信号。
可以通过以下代码计算：

# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
# expression has to be compiled into a Theano function (see the Theano
# tutorial for more details)
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.

2.随机梯度下降法

什么是普通的梯度下降法呢？它是一个简单的算法，在含参损失函数的误差曲面上，向着误差更小的方向一步一步的调整。为了达到这个目标，我们需要将训练数据引入损失函数。该算法的伪代码如下所示：

# GRADIENT DESCENT

while True:
    loss = f(params)
    d_loss_wrt_params = ... # compute gradient
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

随机梯度下降法(Stochastic gradient descent, SGD)遵循着和一般梯度下降法一样的准则。但是随机梯度下降法具有更快的速度，它一次只对一小部分样本估计梯度，而不是全部的样本集。为了更加纯粹的形式，我们每次只对一个样本进行梯度估计。

# STOCHASTIC GRADIENT DESCENT
for (x_i,y_i) in training_set:
                            # imagine an infinite generator
                            # that may repeat examples (if there is only a finite training set)
    loss = f(params, x_i, y_i)
    d_loss_wrt_params = ... # compute gradient
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

用于深度学习的梯度下降法，是一种随机梯度下降法的变体，我们叫它”minibatches”。称为Minibatch SGD(MSGD)，它和SGD工作原理相同，只是在一次估计中使用多个样本，而不仅仅是一个。这种方法减少了估计梯度的方差，并且能够在现代计算机中更好的组织内存。

for (x_batch,y_batch) in train_batches:
                            # imagine an infinite generator
                            # that may repeat examples
    loss = f(params, x_batch, y_batch)
    d_loss_wrt_params = ... # compute gradient using theano
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

在选择minibatch大小(记为B)时，需要一个折衷。在B从1增加到2时，方差的减少和SIMD指令的使用起了很大的作用，但是在B很大时，提升就不是那么明显了。对于更大的B值，时间应该被更好地用在梯度的步进上，而不是减少梯度的方差上。一个最优的B值应该是在模型上，数据集上，和硬件上都是独立的，并且可以取1到几百之间的任意数值，我们在这个教程中定义B=20，但是要记住，它的取值是任意的。

如果你的训练的迭代次数是固定的，minibatch的大小就会变得至关重要，因为它控制着参数更新的次数。迭代次数为10，minibatch为1的迭代结果显然和迭代次数为10，但是minibatch为20的迭代结果不同。在调整minibatch大小时，需要谨记这点。

以上的代码都是伪代码格式的，真实的实用代码如下：

# Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:
    # here x_batch and y_batch are elements of train_batches and
    # therefore numpy arrays; function MSGD also updates the params
    print('Current loss is ', MSGD(x_batch, y_batch))
    if stopping_condition_is_met:
        return params

3.正则化

最优化并不是机器学习的全部内容。在训练中，除了我们给定的完美样本以外，模型还会遇到它从来没有见过的样本。MSGD的训练过程不会考虑到这些，因此可能会对训练样本过拟合。一个用于抵抗过拟合的方法就是：正则化。正则化的方法有很多，在这里我们仅仅介绍ℓ1和ℓ2正则化，以及提前退出。

1.)ℓ1和ℓ2正则化

ℓ1和ℓ2正则化，是在损失函数之后增加了一个额外项，用于惩罚相应的参数配置。在格式上，如果我们的损失函数定义为：

N L L (θ, D) = - \sum i = 0 | D | l o g P (Y = y (i) | x (i), θ)

那么正则化的损失函数定义为：

E (θ, D) = N L L (θ, D) + λ R (θ)

或者，在我们实际使用中，定义为：

E (θ, D) = N L L (θ, D) + λ | | θ | | p p

其中，

| | θ | | p = ⎛ ⎝ \sum j = 0 | θ | | θ j | p ⎞ ⎠ 1 p

这就是

θ的

ℓp范数。

λ是一个超参量，用于调整正则参数的重要程度。最常用的

p值是1和2。所以称为

ℓ1或者

ℓ2约束。如果

p=2，那么正则项也被称为权值衰减(weight decay)。
从原理上来说，在损失项之后增加正则化约束项，可以获得更加平滑的网络映射（因为通过惩罚参数中的大值，可以减少网络模型中非线性的数量）。更加直观的是，两项（

NLL和

R(θ)）分别对应于“更好地拟合数据“（

NLL）和“具有简单平滑的解“（

R(θ)）。通过最小化两项的线性组合，可以寻求“拟合训练数据“和“获得泛化能力“的一个折衷解。根据奥卡姆剃刀定律，最小化过程应该在拟合训练数据的基础上寻找到最简单的解（是否简单，由我们定义的简单约束测量）。
注意，我们寻找到的最“简单“的解，并不意味着这个解具有非常好的泛化能力。从经验上看，在神经网络中加入这种正则化约束可以增强泛化能力，特别是在小数据集的情况下。以下代码，阐述了如何在pyhton中计算具有

ℓ1(系数

λ1)和

ℓ2(系数

λ2)正则项的损失函数。

# symbolic Theano variable that represents the L1 regularization term
L1  = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term
L2_sqr = T.sum(param ** 2)

# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2

2.)提前退出

提前退出(Early-stopping)，通过在一个验证集(validate set)上监控模型性能，来防止过拟合的发生。验证集的样本，没有用于梯度下降法(训练阶段)，但也不是测试集的一部分。验证集样本被视为未来测试样本的一个典型代表。我们可以将它用于训练，因为它并不是测试集的一部分。如果在验证集上，模型的性能已经没有提高了，甚至在某些情况下，模型的性能已经随着训练有所下降时，那么训练程序会提前退出。
提前退出的条件有很多种，我们在此使用基于patience的增加数量的退出策略。

# early-stopping parameters
patience = 5000  # look as this many examples regardless
patience_increase = 2     # wait this much longer when a new best is
                              # found
improvement_threshold = 0.995  # a relative improvement of this much is
                               # considered significant
validation_frequency = min(n_train_batches, patience/2)
                              # go through this many
                              # minibatches before checking the network
                              # on the validation set; in this case we
                              # check every epoch

best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()

done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
    # Report "1" for first epoch, "n_epochs" for last epoch
    epoch = epoch + 1
    for minibatch_index in xrange(n_train_batches):

        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params # gradient descent

        # iteration number. We want it to start at 0.
        iter = (epoch - 1) * n_train_batches + minibatch_index
        # note that if we do `iter % validation_frequency` it will be
        # true for iter = 0 which we do not want. We want it true for
        # iter = validation_frequency - 1.
        if (iter + 1) % validation_frequency == 0:

            this_validation_loss = ... # compute zero-one loss on validation set

            if this_validation_loss < best_validation_loss:

                # improve patience if loss improvement is good enough
                if this_validation_loss < best_validation_loss * improvement_threshold:

                    patience = max(patience, iter * patience_increase)
                best_params = copy.deepcopy(params)
                best_validation_loss = this_validation_loss

        if patience <= iter:
            done_looping = True
            break

# POSTCONDITION:
# best_params refers to the best out-of-sample parameters observed during the optimization

如果我们在达到终止条件(耗尽patience)之前，耗尽了所有的训练样本minibatch，那么我们就从最初的训练样本开始，重复训练。

注意： validation_frequency永远都要小于patience。在耗尽patience之前，代码需要至少两次检查模型性能。这就是为什么我们使用公式：validation_frequency = min( value, patience/2.)

注意： 在决定是否增加patience时，不使用简单的比较，而使用统计显著性的一个测试，有可能可以增强算法的性能。