使用Logistic回归对MNIST手写字符进行分类识别

简介

在这节我们使用Theano用于最基本的分类器：Logistic回归(Logistic Regression)。
全部的代码可以在我的CSDN下载中免费下载：http://download.csdn.net/detail/ws_20100/9222263。
下面我们从模型开始。

模型

逻辑回归是一个概率，线性分类器。它的参数包含一个权值矩阵W和一个偏置向量b。分类器将输入向量映射到一系列超平面上，每个超平面对应一个类别。输入向量与超平面的距离反映了输入属于对应类别的概率。
在数学上，一个输入向量x属于类别i(概率变量Y的值)的概率，记为：

P (Y = i | x, W, b) = s o f t m a x i (W x + b) = e W i x + b i \sum j e W i x + b i

模型的预测值

ypred为概率最大的类别，定义为

y p r e d = a r g m a x i P (Y = i | x, W, b)

Theano中关于模型建立的代码如下：

# initialize with 0 the weights W as a matrix of shape (n_in, n_out)
self.W = theano.shared(
    value=numpy.zeros(
        (n_in, n_out),
        dtype=theano.config.floatX
    ),
    name='W',
    borrow=True
)
# initialize the biases b as a vector of n_out 0s
self.b = theano.shared(
    value=numpy.zeros(
        (n_out,),
        dtype=theano.config.floatX
    ),
    name='b',
    borrow=True
)

# symbolic expression for computing the matrix of class-membership
# probabilities
# Where:
# W is a matrix where column-k represent the separation hyperplane for
# class-k
# x is a matrix where row-j  represents input training sample-j
# b is a vector where element-k represent the free parameter of
# hyperplane-k
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

# symbolic description of how to compute prediction as class whose
# probability is maximal
self.y_pred = T.argmax(self.p_y_given_x, axis=1)

由于模型的参数在训练时始终要保持回归状态，因此我们将W和b共享权值。它们在符号和内容上都存在共享。然后我们使用点积(dot)和softmax回归运算计算向量值P(Y|x,W,b)。结果p_y_given_x是一个向量类型的符号变量。为了得到实际模型的预测值，我们使用T.argmax运算符，这将返回一个索引值，代表在p_y_given_x的哪个位置的值最大(例如，具有最大概率的类别)。
现在，我们所定义的模型其实并不能做任何事情，因为所有的参数都处于初始状态，下面将会介绍，如何学习最优化参数。

定义损失函数

训练最优化模型参数，涉及到最小化损失函数。在多类别的分类问题中，很自然的想法是使用负对数似然作为损失函数。这等价于在参数为θ的模型下，最大化数据集D的似然。我们从定义似然L和损失ℓ始：

L (θ = {W, b}, D) = \sum i = 0 | D | l o g (P (Y = y (i) | x (i), W, b))

ℓ (θ = {W, b}, D) = - L (θ = {W, b}, D)

在最优化理论中，最小化任意非线性函数的最简单方法是梯度下降法(gradient descent)。
这里使用的是小批量的概率梯度方法(mini-batches Stochastic Gradient Descent, MSGD)。Theano代码中关于损失的定义如下：

# y.shape[0] is (symbolically) the number of rows in y, i.e.,
# number of examples (call it n) in the minibatch
# T.arange(y.shape[0]) is a symbolic vector which will contain
# [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
# Log-Probabilities (call it LP) with one row per example and
# one column per class LP[T.arange(y.shape[0]),y] is a vector
# v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
# LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
# the mean (across minibatch examples) of the elements in v,
# i.e., the mean log-likelihood across the minibatch.
return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])

尽管在格式上，损失函数定义为数据集上每个误差项和的形式。但是在代码中，使用的是平均函数(T.mean)，因为这样可以使得学习率不依赖于数据批的大小。

创建一个回归类

我们现在要定义一个LogisticRegression类，来囊括逻辑回归的所有特征。下面的代码包含了许多方面，并且注释的很清楚哦。

class LogisticRegression(object):
    """Multi-class Logistic Regression Class

    The logistic regression is fully described by a weight matrix :math:`W`
    and bias vector :math:`b`. Classification is done by projecting data
    points onto a set of hyperplanes, the distance to which is used to
    determine a class membership probability.
    """

    def __init__(self, input, n_in, n_out):
        """ Initialize the parameters of the logistic regression

        :type input: theano.tensor.TensorType
        :param input: symbolic variable that describes the input of the
                      architecture (one minibatch)

        :type n_in: int
        :param n_in: number of input units, the dimension of the space in
                     which the datapoints lie

        :type n_out: int
        :param n_out: number of output units, the dimension of the space in
                      which the labels lie

        """
        # start-snippet-1
        # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
        self.W = theano.shared(
            value=numpy.zeros(
                (n_in, n_out),
                dtype=theano.config.floatX
            ),
            name='W',
            borrow=True
        )
        # initialize the biases b as a vector of n_out 0s
        self.b = theano.shared(
            value=numpy.zeros(
                (n_out,),
                dtype=theano.config.floatX
            ),
            name='b',
            borrow=True
        )

        # symbolic expression for computing the matrix of class-membership
        # probabilities
        # Where:
        # W is a matrix where column-k represent the separation hyperplane for
        # class-k
        # x is a matrix where row-j  represents input training sample-j
        # b is a vector where element-k represent the free parameter of
        # hyperplane-k
        self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)

        # symbolic description of how to compute prediction as class whose
        # probability is maximal
        self.y_pred = T.argmax(self.p_y_given_x, axis=1)
        # end-snippet-1

        # parameters of the model
        self.params = [self.W, self.b]

        # keep track of model input
        self.input = input

    def negative_log_likelihood(self, y):
        """Return the mean of the negative log-likelihood of the prediction
        of this model under a given target distribution.

        .. math::

            frac{1}{|mathcal{D}|} mathcal{L} (	heta={W,b}, mathcal{D}) =
            frac{1}{|mathcal{D}|} sum_{i=0}^{|mathcal{D}|}
                log(P(Y=y^{(i)}|x^{(i)}, W,b)) \
            ell (	heta={W,b}, mathcal{D})

        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label

        Note: we use the mean instead of the sum so that
              the learning rate is less dependent on the batch size
        """
        # start-snippet-2
        # y.shape[0] is (symbolically) the number of rows in y, i.e.,
        # number of examples (call it n) in the minibatch
        # T.arange(y.shape[0]) is a symbolic vector which will contain
        # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
        # Log-Probabilities (call it LP) with one row per example and
        # one column per class LP[T.arange(y.shape[0]),y] is a vector
        # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
        # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
        # the mean (across minibatch examples) of the elements in v,
        # i.e., the mean log-likelihood across the minibatch.
        return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
        # end-snippet-2

    def errors(self, y):
        """Return a float representing the number of errors in the minibatch
        over the total number of examples of the minibatch ; zero one
        loss over the size of the minibatch

        :type y: theano.tensor.TensorType
        :param y: corresponds to a vector that gives for each example the
                  correct label
        """

        # check if y has same dimension of y_pred
        if y.ndim != self.y_pred.ndim:
            raise TypeError(
                'y should have the same shape as self.y_pred',
                ('y', y.type, 'y_pred', self.y_pred.type)
            )
        # check if y is of the correct datatype
        if y.dtype.startswith('int'):
            # the T.neq operator returns a vector of 0s and 1s, where 1
            # represents a mistake in prediction
            return T.mean(T.neq(self.y_pred, y))
        else:
            raise NotImplementedError()

那么如何来实例化一个LogisticRegression类呢，可以看如下代码：

# generate symbolic variables for input (x and y represent a
# minibatch)
x = T.matrix('x')  # data, presented as rasterized images
y = T.ivector('y')  # labels, presented as 1D vector of [int] labels

# construct the logistic regression class
# Each MNIST image has size 28*28
classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10)

在上面的代码中，首先定义了输入变量x和对应类别y的符号变量。需要注意的是，x和y是定义在LogisticRegression对象以外的。因为这个类需要输入值作为它__init__函数的参数，当你希望连接这些实例，来组成深度网络时，这个设定非常有用。一个层的输出可以当作下一层的输入。（这里并没有构建多层网络，但是代码可以在多层网络中重用）
最后，我们定义一个（符号化）cost变量，用来最小化，使用实例方法classifier.negative_log_likelihood。

# the cost we minimize during training is the negative log likelihood of
# the model in symbolic format
cost = classifier.negative_log_likelihood(y)

注意定义cost中有一个隐含的符号输入x，因为classifier的符号变量在初始化时就定义在x中。

学习模型

在多数编程语言（C/C++，Matlab，Python）中实现MSGD，可以使用损失函数对于参数的梯度表达式：∂ℓ/∂W和∂ℓ/∂b。在复杂的模型中，具有严格的表达式形式∂ℓ/∂θ，特别是在需要考虑数值稳定性的情况下。
使用Theano，这种工作被大量简化。它可以完成自动求导，并应用相应的数学变换，以提高数值稳定性。
在Theano中获得∂ℓ/∂W和∂ℓ/∂b，仅仅只需下面代码：

g_W = T.grad(cost=cost, wrt=classifier.W)
g_b = T.grad(cost=cost, wrt=classifier.b)

g_W和g_b是符号变量，可以用于计算。函数train_model可以完成梯度下降，定义如下：

# specify how to update the parameters of the model as a list of
# (variable, update expression) pairs.
updates = [(classifier.W, classifier.W - learning_rate * g_W),
           (classifier.b, classifier.b - learning_rate * g_b)]

# compiling a Theano function `train_model` that returns the cost, but in
# the same time updates the parameter of the model based on the rules
# defined in `updates`
train_model = theano.function(
    inputs=[index],
    outputs=cost,
    updates=updates,
    givens={
        x: train_set_x[index * batch_size: (index + 1) * batch_size],
        y: train_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

updates是一系列二元组。在每个二元组中，第一个元素是待更新的符号变量，第二个元素是用于计算新数值的符号函数。相似地，givens是一个字典，其中关键字是符号变量，其中的值指定了它们的置换。函数train_model定义如下：

输入是小批量数据的索引index，连带数据批的大小(这不是输入，因为它是一个固定值)，定义了x和相应的标签y。
返回值是index对应的x和y所计算出的代价/损失值。
对于每个函数调用，首先根据index置换训练集中x和y的值。然后，将会评估该数据批所对应的代价值，并应用updates更新。

每次调用train_model(index)，它将会计算数据批，返回其代价值，并完成了MSGD的一步。整个的学习算法会不断地循环遍历所有的数据集，并且一次只会考虑一批数据内的所有样本，然后重复地调用train_model函数。

测试模型

当测试模型的时候，我们注重于错误分类的样本个数。所以，LogisticRegression类有一个额外的实例方法，用于尝试减少每个数据批中错误分类的样本个数。代码如下：

def errors(self, y):
    """Return a float representing the number of errors in the minibatch
    over the total number of examples of the minibatch ; zero one
    loss over the size of the minibatch

    :type y: theano.tensor.TensorType
    :param y: corresponds to a vector that gives for each example the
              correct label
    """

    # check if y has same dimension of y_pred
    if y.ndim != self.y_pred.ndim:
        raise TypeError(
            'y should have the same shape as self.y_pred',
            ('y', y.type, 'y_pred', self.y_pred.type)
        )
    # check if y is of the correct datatype
    if y.dtype.startswith('int'):
        # the T.neq operator returns a vector of 0s and 1s, where 1
        # represents a mistake in prediction
        return T.mean(T.neq(self.y_pred, y))
    else:
        raise NotImplementedError()

然后，我们创建了一个函数test_model和一个函数validate_model，我们可以调用这些函数来挽回错误分类的值。你将会看到，validate_model是迭代退出的关键。这些函数的输入参数是数据批的索引号，然后函数将计算该索引号所对应的数据批中错误分类的个数。两个函数的唯一区别在于test_model面向的是测试集，validate_model面向的是验证集。

# compiling a Theano function that computes the mistakes that are made by
# the model on a minibatch
test_model = theano.function(
    inputs=[index],
    outputs=classifier.errors(y),
    givens={
        x: test_set_x[index * batch_size: (index + 1) * batch_size],
        y: test_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

validate_model = theano.function(
    inputs=[index],
    outputs=classifier.errors(y),
    givens={
        x: valid_set_x[index * batch_size: (index + 1) * batch_size],
        y: valid_set_y[index * batch_size: (index + 1) * batch_size]
    }
)

综合代码

最后的代码如下：【下载地址：http://download.csdn.net/detail/ws_20100/9222263】
使用者可以通过输入以下命令，使用SGD逻辑回归对MNIST字符进行分类。

python code/logistic_sgd.py

输出应该是这样的形式：

...
epoch 72, minibatch 83/83, validation error 7.510417 %
     epoch 72, minibatch 83/83, test error of best model 7.510417 %
epoch 73, minibatch 83/83, validation error 7.500000 %
     epoch 73, minibatch 83/83, test error of best model 7.489583 %
Optimization complete with best validation score of 7.500000 %,with test performance 7.489583 %
The code run for 74 epochs, with 1.936983 epochs/sec

在一个Intel酷睿双核CPU E8400 @3.00 Ghz的主机上，大约1.936 epochs/sec，在经历75 epochs后，测试误差为7.489%。在GPU上，大约10.0 epochs/sec。

使用已训练的模型进行预测

当训练达到最低误差的时候，我们可以重新载入模型对新数据的标签进行预测，predict函数完成了这些操作：

def predict():
    """
    An example of how to load a trained model and use it
    to predict labels.
    """

    # load the saved model
    classifier = cPickle.load(open('best_model.pkl'))

    # compile a predictor function
    predict_model = theano.function(
        inputs=[classifier.input],
        outputs=classifier.y_pred)

    # We can test it on some examples from test test
    dataset='mnist.pkl.gz'
    datasets = load_data(dataset)
    test_set_x, test_set_y = datasets[2]
    test_set_x = test_set_x.get_value()

    predicted_values = predict_model(test_set_x[:10])
    print ("Predicted values for the first 10 examples in test set:")
    print predicted_values

参考资料

Theano深度学习资料：http://deeplearning.net/tutorial/logreg.html

【脚注】
[1]对于更小的数据集或更简单的模型，复杂的下降算法可能更加有效。logistic_cg.py代码阐述了如何使用SciPy的共扼梯度方法(conjugate gradient solver)完成逻辑回归任务。logistic_cg.py代码可以在我的CSDN下载中免费下载：http://download.csdn.net/detail/ws_20100/9223959