Python之ML--人工神经网络识别图像

Python之ML–人工神经网络识别图像

深度学习可以被看作是一组算法的集合,这些算法能够高效地进行多层人工神经网络训练

主要知识点如下：

理解多层神经网络
训练用于图像分类的神经网络
实现强大的反向传播算法
调试已实现的神经网络

一.人工神经网络对复杂函数建模

1.单层神经网络

在深入讨论多层神经网络结构之前,我们来简要回顾一个单层神经网络的相关概念,如下图所示的自适应线性神经元(Adaline)算法

from IPython.display import Image

我们实现了用于二类别分类的Adaline算法,并通过梯度下降优化算法来学习模型的权重系数.训练集上的每一次迭代,我们使用如下更新规则来更新权重向量w：

换句话说,我们基于整个训练数据集来计算梯度,并沿着与梯度▽J(w)相反的方向前进以更新模型的权重.为了找到模型的最优权重,我们将待优化的目标函数定义为误差平方和(SSE)代价函数J(w).此外我们为梯度增加了一个经过精心挑选的因子：学习速率η,在学习过程中用于权衡学习速度和代价函数全局最优点之间的关系

在梯度下降优化过程中,我们在每次迭代后同时更新所有权重,并将权重向量w中各权值w_j的偏导定义为：

2.多层神经网络架构简介

接下来我们将学习如何将多个单独的神经元连接为一个多层前馈神经网络(multilayer feedforward neural network).这种特殊类型的网络也称为多层感知器(multi-layer perceptron,MLP).下图解释了三层MLP的概念：一个输入层,一个隐层以及一个输出层.隐层的所有单元完全连接到输入层上,同时输出层的单元也完全连接到了隐层中.如果网络中包含不止一个隐层,我们则称其为深度人工神经网络

如上图所示,我们将第l层中第i个激励单元记为a_{i}{(1)}和a_{0}^{(2)}为偏置单元(bias unit),我们均设定为1.输入层各单元的激励为输入加上偏置单元：

3.通过正向传播构造神经网络

我们将使用正向传播(forward propagation)来计算多层感知器(MLP)模型的输出.为理解正向传播是如何通过学习来拟合多层感知器模型,我们将多层感知器的学习过程总结为三个简单步骤：

从输入层开始,通过网络向前传播(也就是正向传播)训练数据中的模式,以生成输出
基于网络的输出,通过一个代价函数计算所需最小化的误差
反向传播误差,计算其对于网络中每个权重的导数,并更新模型

为了解决图像分类等复杂问题,我们需要在多层感知器模型中使用非线性激励函数,例如使用逻辑斯谛回归中所使用的sigmoid激励函数

我们应该记得,sigmoid函数的图像为S型曲线,它可以将净输入映射到一个介于[0,1]区间的逻辑斯谛分布上去,分布的原点为z=0.5处

二.手写数字的识别

神经网络理论可以说是相当复杂的,因此推荐额外两篇文献：

通过MNIST数据集(Mixed National Institute of Standards and Technology数据集的缩写)上对手写数字的识别,来完成我们第一个多层神经网络的训练.MNIST数据集是机器学习算法中常用的一个基准数据集

1.获取MNIST数据集

MNIST数据集可通过链接下载,它包含如下四个部分：

训练集图像：train-images-idx3-ubyte.gz(9.45MB,解压后44.8MB,包含60000个样本)
训练集类标：train-labels-idx1-ubyte.gz(29KB,解压后59KB,包含10000个类标)
测试集图像：t10k-images-idx3-ubyte.gz(1.6MB,解压后7.5MB,包含10000个样本)
测试集类标：t10k-labels-idx1-ubyte.gz(5KB,解压后10KB,包含10000个类标)

我们将MNIST数据集读人Numpy数组以训练和测试多层感知器模型：

import os
import struct
import numpy as np
import gzip
 
def load_mnist(path, kind='train'):
    """Load MNIST data from path"""
    labels_path = os.path.join(path, 
                               '%s-labels-idx1-ubyte.gz' % kind)
    images_path = os.path.join(path, 
                               '%s-images-idx3-ubyte.gz' % kind)
        
    with gzip.open(labels_path, 'rb') as lbpath:
        lbpath.read(8)
        buffer = lbpath.read()
        labels = np.frombuffer(buffer, dtype=np.uint8)

    with gzip.open(images_path, 'rb') as imgpath:
        imgpath.read(16)
        buffer = imgpath.read()
        images = np.frombuffer(buffer, 
                               dtype=np.uint8).reshape(
            len(labels), 784).astype(np.float64)
 
    return images, labels

load_mnist函数返回两个数组,第一个为n_m维的Numpy数组(存储图像),其中n为样本数量,m为特征数量.训练数据集和测试数据集分别包含60000和10000个样本.MNIST数据集中的图像均为28_28个像素,每个像素用灰度强度值表示.返回的第二个数组(类标)包含对应的目标变量,即手写数字对应的类标(整数0-9)

我们将从解压后MNIST数据集所在目录/data/mnist下读取60000个训练实例和10000个测试实例

X_train, y_train = load_mnist('./data/mnist/', kind='train')
print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1]))

Rows: 60000, columns: 784

X_test,y_test = load_mnist('./data/mnist/', kind='t10k')
print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1]))

Rows: 10000, columns: 784

为了解MNIST数据集中图像的样子,我们通过将特征矩阵中的784像素向量还原为28*28图像,并使用matplotlib中的imshow函数将0-9数字的示例进行可视化展示：

import matplotlib.pyplot as plt

fig,ax=plt.subplots(nrows=2,ncols=5,sharex=True,sharey=True)

ax=ax.flatten()

for i in range(10):
    img=X_train[y_train==i][0].reshape(28,28)
    ax[i].imshow(img,cmap='Greys',interpolation='nearest')
    
ax[0].set_xticks([])
ax[0].set_yticks([])

plt.tight_layout()

plt.show()

我们再绘制一下相同数字的多个示例,来看一个这些手写样本之间到底有多大差异：

fig,ax=plt.subplots(nrows=5,ncols=5,
                   sharex=True,sharey=True)

ax=ax.flatten()

for i in range(25):
    img=X_train[y_train==7][i].reshape(28,28)
    ax[i].imshow(img,cmap='Greys',interpolation='nearest')
    
ax[0].set_xticks([])
ax[0].set_yticks([])

plt.tight_layout()

plt.show()

我们也可以选择将MNIST图像数据及其对应类标存储为CSV格式的文件,以方便不支持其原始特殊字节格式的程序使用

在将MNIST数据加载到Numpy数组中后,我们可以在python中执行如下代码,即可将数据存储为CSV格式文件：

np.savetxt('train_imgs.csv',X_train,fmt='%i',delimiter=',')

np.savetxt('train_labels.csv',y_train,fmt='%i',delimiter=',')

np.savetxt('test_imgs.csv',X_test,fmt='%i',delimiter=',')

np.savetxt('test_labels.csv',y_test,fmt='%i',delimiter=',')

对于已经保存过的CSV格式文件,我们可以使用Numpy的genfromtxt函数对其进行加载：

X_train=np.genfromtxt('train_imgs.csv',dtype=int,delimiter=',')

y_train=np.genfromtxt('train_labels.csv',dtype=int,delimiter=',')

X_test=np.genfromtxt('test_imgs.csv',dtype=int,delimiter=',')

y_test=np.genfromtxt('test_labels.csv',dtype=int,delimiter=',')

不过,加载CSV格式的MNIST数据需要更长的时间,因此建议尽可能使用原始的数据格式

2.实现一个多层感知器

我们将实现一个包含一个输入层,一个隐层和一个输出层的多层感知器,并用它来识别MNIST数据集中的图像.

import numpy as np
from scipy.special import expit
import sys


class NeuralNetMLP(object):
    """ Feedforward neural network / Multi-layer perceptron classifier.

    Parameters
    ------------
    n_output : int
        Number of output units, should be equal to the
        number of unique class labels.
    n_features : int
        Number of features (dimensions) in the target dataset.
        Should be equal to the number of columns in the X array.
    n_hidden : int (default: 30)
        Number of hidden units.
    l1 : float (default: 0.0)
        Lambda value for L1-regularization.
        No regularization if l1=0.0 (default)
    l2 : float (default: 0.0)
        Lambda value for L2-regularization.
        No regularization if l2=0.0 (default)
    epochs : int (default: 500)
        Number of passes over the training set.
    eta : float (default: 0.001)
        Learning rate.
    alpha : float (default: 0.0)
        Momentum constant. Factor multiplied with the
        gradient of the previous epoch t-1 to improve
        learning speed
        w(t) := w(t) - (grad(t) + alpha*grad(t-1))
    decrease_const : float (default: 0.0)
        Decrease constant. Shrinks the learning rate
        after each epoch via eta / (1 + epoch*decrease_const)
    shuffle : bool (default: True)
        Shuffles training data every epoch if True to prevent circles.
    minibatches : int (default: 1)
        Divides training data into k minibatches for efficiency.
        Normal gradient descent learning if k=1 (default).
    random_state : int (default: None)
        Set random state for shuffling and initializing the weights.

    Attributes
    -----------
    cost_ : list
      Sum of squared errors after each epoch.

    """
    def __init__(self, n_output, n_features, n_hidden=30,
                 l1=0.0, l2=0.0, epochs=500, eta=0.001,
                 alpha=0.0, decrease_const=0.0, shuffle=True,
                 minibatches=1, random_state=None):

        np.random.seed(random_state)
        self.n_output = n_output
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.w1, self.w2 = self._initialize_weights()
        self.l1 = l1
        self.l2 = l2
        self.epochs = epochs
        self.eta = eta
        self.alpha = alpha
        self.decrease_const = decrease_const
        self.shuffle = shuffle
        self.minibatches = minibatches

    def _encode_labels(self, y, k):
        """Encode labels into one-hot representation

        Parameters
        ------------
        y : array, shape = [n_samples]
            Target values.

        Returns
        -----------
        onehot : array, shape = (n_labels, n_samples)

        """
        onehot = np.zeros((k, y.shape[0]))
        for idx, val in enumerate(y):
            onehot[val, idx] = 1.0
        return onehot

    def _initialize_weights(self):
        """Initialize weights with small random numbers."""
        w1 = np.random.uniform(-1.0, 1.0,
                               size=self.n_hidden*(self.n_features + 1))
        w1 = w1.reshape(self.n_hidden, self.n_features + 1)
        w2 = np.random.uniform(-1.0, 1.0,
                               size=self.n_output*(self.n_hidden + 1))
        w2 = w2.reshape(self.n_output, self.n_hidden + 1)
        return w1, w2

    def _sigmoid(self, z):
        """Compute logistic function (sigmoid)

        Uses scipy.special.expit to avoid overflow
        error for very small input values z.

        """
        # return 1.0 / (1.0 + np.exp(-z))
        return expit(z)

    def _sigmoid_gradient(self, z):
        """Compute gradient of the logistic function"""
        sg = self._sigmoid(z)
        return sg * (1.0 - sg)

    def _add_bias_unit(self, X, how='column'):
        """Add bias unit (column or row of 1s) to array at index 0"""
        if how == 'column':
            X_new = np.ones((X.shape[0], X.shape[1] + 1))
            X_new[:, 1:] = X
        elif how == 'row':
            X_new = np.ones((X.shape[0] + 1, X.shape[1]))
            X_new[1:, :] = X
        else:
            raise AttributeError('`how` must be `column` or `row`')
        return X_new

    def _feedforward(self, X, w1, w2):
        """Compute feedforward step

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            Input layer with original features.
        w1 : array, shape = [n_hidden_units, n_features]
            Weight matrix for input layer -> hidden layer.
        w2 : array, shape = [n_output_units, n_hidden_units]
            Weight matrix for hidden layer -> output layer.

        Returns
        ----------
        a1 : array, shape = [n_samples, n_features+1]
            Input values with bias unit.
        z2 : array, shape = [n_hidden, n_samples]
            Net input of hidden layer.
        a2 : array, shape = [n_hidden+1, n_samples]
            Activation of hidden layer.
        z3 : array, shape = [n_output_units, n_samples]
            Net input of output layer.
        a3 : array, shape = [n_output_units, n_samples]
            Activation of output layer.

        """
        a1 = self._add_bias_unit(X, how='column')
        z2 = w1.dot(a1.T)
        a2 = self._sigmoid(z2)
        a2 = self._add_bias_unit(a2, how='row')
        z3 = w2.dot(a2)
        a3 = self._sigmoid(z3)
        return a1, z2, a2, z3, a3

    def _L2_reg(self, lambda_, w1, w2):
        """Compute L2-regularization cost"""
        return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2) +
                                np.sum(w2[:, 1:] ** 2))

    def _L1_reg(self, lambda_, w1, w2):
        """Compute L1-regularization cost"""
        return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum() +
                                np.abs(w2[:, 1:]).sum())

    def _get_cost(self, y_enc, output, w1, w2):
        """Compute cost function.

        Parameters
        ----------
        y_enc : array, shape = (n_labels, n_samples)
            one-hot encoded class labels.
        output : array, shape = [n_output_units, n_samples]
            Activation of the output layer (feedforward)
        w1 : array, shape = [n_hidden_units, n_features]
            Weight matrix for input layer -> hidden layer.
        w2 : array, shape = [n_output_units, n_hidden_units]
            Weight matrix for hidden layer -> output layer.

        Returns
        ---------
        cost : float
            Regularized cost.

        """
        term1 = -y_enc * (np.log(output))
        term2 = (1.0 - y_enc) * np.log(1.0 - output)
        cost = np.sum(term1 - term2)
        L1_term = self._L1_reg(self.l1, w1, w2)
        L2_term = self._L2_reg(self.l2, w1, w2)
        cost = cost + L1_term + L2_term
        return cost

    def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2):
        """ Compute gradient step using backpropagation.

        Parameters
        ------------
        a1 : array, shape = [n_samples, n_features+1]
            Input values with bias unit.
        a2 : array, shape = [n_hidden+1, n_samples]
            Activation of hidden layer.
        a3 : array, shape = [n_output_units, n_samples]
            Activation of output layer.
        z2 : array, shape = [n_hidden, n_samples]
            Net input of hidden layer.
        y_enc : array, shape = (n_labels, n_samples)
            one-hot encoded class labels.
        w1 : array, shape = [n_hidden_units, n_features]
            Weight matrix for input layer -> hidden layer.
        w2 : array, shape = [n_output_units, n_hidden_units]
            Weight matrix for hidden layer -> output layer.

        Returns
        ---------
        grad1 : array, shape = [n_hidden_units, n_features]
            Gradient of the weight matrix w1.
        grad2 : array, shape = [n_output_units, n_hidden_units]
            Gradient of the weight matrix w2.

        """
        # backpropagation
        sigma3 = a3 - y_enc
        z2 = self._add_bias_unit(z2, how='row')
        sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2)
        sigma2 = sigma2[1:, :]
        grad1 = sigma2.dot(a1)
        grad2 = sigma3.dot(a2.T)

        # regularize
        grad1[:, 1:] += self.l2 * w1[:, 1:]
        grad1[:, 1:] += self.l1 * np.sign(w1[:, 1:])
        grad2[:, 1:] += self.l2 * w2[:, 1:]
        grad2[:, 1:] += self.l1 * np.sign(w2[:, 1:])

        return grad1, grad2

    def predict(self, X):
        """Predict class labels

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            Input layer with original features.

        Returns:
        ----------
        y_pred : array, shape = [n_samples]
            Predicted class labels.

        """
        if len(X.shape) != 2:
            raise AttributeError('X must be a [n_samples, n_features] array.
'
                                 'Use X[:,None] for 1-feature classification,'
                                 '
or X[[i]] for 1-sample classification')

        a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2)
        y_pred = np.argmax(z3, axis=0)
        return y_pred

    def fit(self, X, y, print_progress=False):
        """ Learn weights from training data.

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            Input layer with original features.
        y : array, shape = [n_samples]
            Target class labels.
        print_progress : bool (default: False)
            Prints progress as the number of epochs
            to stderr.

        Returns:
        ----------
        self

        """
        self.cost_ = []
        X_data, y_data = X.copy(), y.copy()
        y_enc = self._encode_labels(y, self.n_output)

        delta_w1_prev = np.zeros(self.w1.shape)
        delta_w2_prev = np.zeros(self.w2.shape)

        for i in range(self.epochs):

            # adaptive learning rate
            self.eta /= (1 + self.decrease_const*i)

            if print_progress:
                sys.stderr.write('
Epoch: %d/%d' % (i+1, self.epochs))
                sys.stderr.flush()

            if self.shuffle:
                idx = np.random.permutation(y_data.shape[0])
                X_data, y_enc = X_data[idx], y_enc[:, idx]

            mini = np.array_split(range(y_data.shape[0]), self.minibatches)
            for idx in mini:

                # feedforward
                a1, z2, a2, z3, a3 = self._feedforward(X_data[idx],
                                                       self.w1,
                                                       self.w2)
                cost = self._get_cost(y_enc=y_enc[:, idx],
                                      output=a3,
                                      w1=self.w1,
                                      w2=self.w2)
                self.cost_.append(cost)

                # compute gradient via backpropagation
                grad1, grad2 = self._get_gradient(a1=a1, a2=a2,
                                                  a3=a3, z2=z2,
                                                  y_enc=y_enc[:, idx],
                                                  w1=self.w1,
                                                  w2=self.w2)

                delta_w1, delta_w2 = self.eta * grad1, self.eta * grad2
                self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev))
                self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev))
                delta_w1_prev, delta_w2_prev = delta_w1, delta_w2

        return self

现在,我们来初始化一个784-50-10的感知器模型,该神经网络包含784个输入单元(n_features),50个隐含单元(n_hidden),以及10个输出单元(n_output)

nn=NeuralNetMLP(n_output=10,n_features=X_train.shape[1],
               n_hidden=50,
               l2=0.1,l1=0.0,
               epochs=1000,
               eta=0.001,
               decrease_const=0.00001,shuffle=True,
               minibatches=50,random_state=1)

某些参数进行说明：

l2:l2正则化系数γ,用于降低拟合程度,类似地,l1对应L1正则化参数γ
epochs:遍历训练集的次数(迭代次数)
eta:学习速率
alpha:动量学习进度的参数
decrease_const:用于降低自适应学习速率n的常数d,随着迭代次数的增加而随之递减以更好地确保收敛
shuffle:在每次迭代前打乱训练集的顺序,以防止算法陷入死循环
Minibatches:在每次迭代中,将训练集划分为k个小的批次,为加速学习的过程,梯度由每个批次分别计算,而不是在整个训练数据集上进行计算

接下来,我们将使用重排后的MNIST训练数据集中的60000个样本来训练多层感知器.执行下列代码时：训练神经网络所需的时间大约为10-30分钟：

nn.fit(X_train,y_train,print_progress=True)

Epoch: 1000/1000

<__main__.NeuralNetMLP at 0x26f90c33d30>

通过下图可见,代价函数的图像中有很明显的噪声.这是由于我们使用了随机梯度下降算法的一个变种(子批次学习),来训练神经网络所造成的

plt.plot(range(len(nn.cost_)),nn.cost_)

plt.ylim([0,2000])
plt.ylabel('Cost')
plt.xlabel('Epochs*50')

plt.tight_layout()

plt.show()

虽然通过上图可以看出,优化算法在经过大约800(40000/50=800)轮迭代后收敛,使用所有子批次的平均值,我们绘制出了一个相对平滑的代价函数的图像.代码如下：

batches=np.array_split(range(len(nn.cost_)),1000)

cost_ary=np.array(nn.cost_)
cost_avgs=[np.mean(cost_ary[i]) for i in batches]

plt.plot(range(len(cost_avgs)),cost_avgs,color='red')

plt.ylim([0,2000])
plt.ylabel('Cost')
plt.xlabel('Epochs')

plt.tight_layout()

plt.show()

从上图可以清楚地看出,训练算法在经过800处迭代后随即收敛：

我们通过计算预测精度来评估模型的性能：

y_train_pred=nn.predict(X_train)

acc=np.sum(y_train==y_train_pred,axis=0)/X_train.shape[0]

print('Training accuarcy:%.2f%%'%(acc*100))

Training accuarcy:97.65%

模型能够正确识别大部分的训练数字,不过现在还不知道将其泛化到未知数据上的效果如何?我们来计算一下模型在测试数据集上10000个图像上的准确率：

y_test_pred=nn.predict(X_test)

acc=np.sum(y_test==y_test_pred,axis=0)/X_test.shape[0]

print('Testing accuracy:%.2f%%'%(acc*100))

Testing accuracy:95.83%

由于模型上训练集与测试集上的精度仅有微小的差异,我们可以推断,模型对于训练数据仅轻微地过拟合.为了进一步对模型进行调优,我们可以改变隐层单元的数量,正则化参数的值,学习速率,衰减常数的值

现在,我们看一下多层感知器难以处理的一些图像：

miscl_img=X_test[y_test!=y_test_pred][:25]
correct_lab=y_test[y_test!=y_test_pred][:25]

miscl_lab=y_test_pred[y_test!=y_test_pred][:25]

fig,ax=plt.subplots(nrows=5,ncols=5,sharex=True,sharey=True)

ax=ax.flatten()

for i in range(25):
    img=miscl_img[i].reshape(28,28)
    ax[i].imshow(img,cmap='Greys',interpolation='nearest')
    
    ax[i].set_title('%d) t:%d p:%d'%(i+1,correct_lab[i],miscl_lab[i]))
    
ax[0].set_xticks([])
ax[0].set_yticks([])

plt.tight_layout()

plt.show()

从上图可知,我们得到一个包含5x5子图矩阵的图像,每个子图标题中的第一个数字为图像索引,第二个数字为真实的类标(t),第三个数字则是预测的类标§.某些图像即便让我们人工去分类也存在一定难度

三.人工神经网络的训练

现在我们将更深入挖掘一些概念,如用于权值学习的逻辑斯谛代价函数(logistic cost function)和反向传播(backpropagation)算法

1.逻辑斯谛代价函数

2.反向传播训练神经网络

四.建立对反向传播的直观认识

从本质上讲,反向传播仅仅是一种高效地计算复杂代价函数导数的方法.我们的目标是使用这些导数学习权重系数,以对多层人工神经网络进行参数化

关于自动微分在机器学习中的应用,建议参考下面资源：自动微分

自动微分包含两种模式,分别为：正向积分模式和反向积分模式

五.通过梯度检验调试神经网络

import numpy as np
from scipy.special import expit
import sys


class MLPGradientCheck(object):
    """ Feedforward neural network / Multi-layer perceptron classifier.

    Parameters
    ------------
    n_output : int
        Number of output units, should be equal to the
        number of unique class labels.
    n_features : int
        Number of features (dimensions) in the target dataset.
        Should be equal to the number of columns in the X array.
    n_hidden : int (default: 30)
        Number of hidden units.
    l1 : float (default: 0.0)
        Lambda value for L1-regularization.
        No regularization if l1=0.0 (default)
    l2 : float (default: 0.0)
        Lambda value for L2-regularization.
        No regularization if l2=0.0 (default)
    epochs : int (default: 500)
        Number of passes over the training set.
    eta : float (default: 0.001)
        Learning rate.
    alpha : float (default: 0.0)
        Momentum constant. Factor multiplied with the
        gradient of the previous epoch t-1 to improve
        learning speed
        w(t) := w(t) - (grad(t) + alpha*grad(t-1))
    decrease_const : float (default: 0.0)
        Decrease constant. Shrinks the learning rate
        after each epoch via eta / (1 + epoch*decrease_const)
    shuffle : bool (default: False)
        Shuffles training data every epoch if True to prevent circles.
    minibatches : int (default: 1)
        Divides training data into k minibatches for efficiency.
        Normal gradient descent learning if k=1 (default).
    random_state : int (default: None)
        Set random state for shuffling and initializing the weights.

    Attributes
    -----------
    cost_ : list
        Sum of squared errors after each epoch.

    """
    def __init__(self, n_output, n_features, n_hidden=30,
                 l1=0.0, l2=0.0, epochs=500, eta=0.001,
                 alpha=0.0, decrease_const=0.0, shuffle=True,
                 minibatches=1, random_state=None):

        np.random.seed(random_state)
        self.n_output = n_output
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.w1, self.w2 = self._initialize_weights()
        self.l1 = l1
        self.l2 = l2
        self.epochs = epochs
        self.eta = eta
        self.alpha = alpha
        self.decrease_const = decrease_const
        self.shuffle = shuffle
        self.minibatches = minibatches

    def _encode_labels(self, y, k):
        """Encode labels into one-hot representation

        Parameters
        ------------
        y : array, shape = [n_samples]
            Target values.

        Returns
        -----------
        onehot : array, shape = (n_labels, n_samples)

        """
        onehot = np.zeros((k, y.shape[0]))
        for idx, val in enumerate(y):
            onehot[val, idx] = 1.0
        return onehot

    def _initialize_weights(self):
        """Initialize weights with small random numbers."""
        w1 = np.random.uniform(-1.0, 1.0,
                               size=self.n_hidden*(self.n_features + 1))
        w1 = w1.reshape(self.n_hidden, self.n_features + 1)
        w2 = np.random.uniform(-1.0, 1.0,
                               size=self.n_output*(self.n_hidden + 1))
        w2 = w2.reshape(self.n_output, self.n_hidden + 1)
        return w1, w2

    def _sigmoid(self, z):
        """Compute logistic function (sigmoid)

        Uses scipy.special.expit to avoid overflow
        error for very small input values z.

        """
        # return 1.0 / (1.0 + np.exp(-z))
        return expit(z)

    def _sigmoid_gradient(self, z):
        """Compute gradient of the logistic function"""
        sg = self._sigmoid(z)
        return sg * (1.0 - sg)

    def _add_bias_unit(self, X, how='column'):
        """Add bias unit (column or row of 1s) to array at index 0"""
        if how == 'column':
            X_new = np.ones((X.shape[0], X.shape[1] + 1))
            X_new[:, 1:] = X
        elif how == 'row':
            X_new = np.ones((X.shape[0]+1, X.shape[1]))
            X_new[1:, :] = X
        else:
            raise AttributeError('`how` must be `column` or `row`')
        return X_new

    def _feedforward(self, X, w1, w2):
        """Compute feedforward step

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            Input layer with original features.
        w1 : array, shape = [n_hidden_units, n_features]
            Weight matrix for input layer -> hidden layer.
        w2 : array, shape = [n_output_units, n_hidden_units]
            Weight matrix for hidden layer -> output layer.

        Returns
        ----------
        a1 : array, shape = [n_samples, n_features+1]
            Input values with bias unit.
        z2 : array, shape = [n_hidden, n_samples]
            Net input of hidden layer.
        a2 : array, shape = [n_hidden+1, n_samples]
            Activation of hidden layer.
        z3 : array, shape = [n_output_units, n_samples]
            Net input of output layer.
        a3 : array, shape = [n_output_units, n_samples]
            Activation of output layer.

        """
        a1 = self._add_bias_unit(X, how='column')
        z2 = w1.dot(a1.T)
        a2 = self._sigmoid(z2)
        a2 = self._add_bias_unit(a2, how='row')
        z3 = w2.dot(a2)
        a3 = self._sigmoid(z3)
        return a1, z2, a2, z3, a3

    def _L2_reg(self, lambda_, w1, w2):
        """Compute L2-regularization cost"""
        return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2) +
                                np.sum(w2[:, 1:] ** 2))

    def _L1_reg(self, lambda_, w1, w2):
        """Compute L1-regularization cost"""
        return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum() +
                                np.abs(w2[:, 1:]).sum())

    def _get_cost(self, y_enc, output, w1, w2):
        """Compute cost function.

        Parameters
        ----------
        y_enc : array, shape = (n_labels, n_samples)
            one-hot encoded class labels.
        output : array, shape = [n_output_units, n_samples]
            Activation of the output layer (feedforward)
        w1 : array, shape = [n_hidden_units, n_features]
            Weight matrix for input layer -> hidden layer.
        w2 : array, shape = [n_output_units, n_hidden_units]
            Weight matrix for hidden layer -> output layer.

        Returns
        ---------
        cost : float
            Regularized cost.

        """
        term1 = -y_enc * (np.log(output))
        term2 = (1.0 - y_enc) * np.log(1.0 - output)
        cost = np.sum(term1 - term2)
        L1_term = self._L1_reg(self.l1, w1, w2)
        L2_term = self._L2_reg(self.l2, w1, w2)
        cost = cost + L1_term + L2_term
        return cost

    def _get_gradient(self, a1, a2, a3, z2, y_enc, w1, w2):
        """ Compute gradient step using backpropagation.

        Parameters
        ------------
        a1 : array, shape = [n_samples, n_features+1]
            Input values with bias unit.
        a2 : array, shape = [n_hidden+1, n_samples]
            Activation of hidden layer.
        a3 : array, shape = [n_output_units, n_samples]
            Activation of output layer.
        z2 : array, shape = [n_hidden, n_samples]
            Net input of hidden layer.
        y_enc : array, shape = (n_labels, n_samples)
            one-hot encoded class labels.
        w1 : array, shape = [n_hidden_units, n_features]
            Weight matrix for input layer -> hidden layer.
        w2 : array, shape = [n_output_units, n_hidden_units]
            Weight matrix for hidden layer -> output layer.

        Returns
        ---------
        grad1 : array, shape = [n_hidden_units, n_features]
            Gradient of the weight matrix w1.
        grad2 : array, shape = [n_output_units, n_hidden_units]
            Gradient of the weight matrix w2.

        """
        # backpropagation
        sigma3 = a3 - y_enc
        z2 = self._add_bias_unit(z2, how='row')
        sigma2 = w2.T.dot(sigma3) * self._sigmoid_gradient(z2)
        sigma2 = sigma2[1:, :]
        grad1 = sigma2.dot(a1)
        grad2 = sigma3.dot(a2.T)

        # regularize
        grad1[:, 1:] += self.l2 * w1[:, 1:]
        grad1[:, 1:] += self.l1 * np.sign(w1[:, 1:])
        grad2[:, 1:] += self.l2 * w2[:, 1:]
        grad2[:, 1:] += self.l1 * np.sign(w2[:, 1:])

        return grad1, grad2

    def _gradient_checking(self, X, y_enc, w1, w2, epsilon, grad1, grad2):
        """ Apply gradient checking (for debugging only)

        Returns
        ---------
        relative_error : float
          Relative error between the numerically
          approximated gradients and the backpropagated gradients.

        """
        num_grad1 = np.zeros(np.shape(w1))
        epsilon_ary1 = np.zeros(np.shape(w1))
        for i in range(w1.shape[0]):
            for j in range(w1.shape[1]):
                epsilon_ary1[i, j] = epsilon
                a1, z2, a2, z3, a3 = self._feedforward(X,
                                                       w1 - epsilon_ary1, w2)
                cost1 = self._get_cost(y_enc, a3, w1-epsilon_ary1, w2)
                a1, z2, a2, z3, a3 = self._feedforward(X,
                                                       w1 + epsilon_ary1, w2)
                cost2 = self._get_cost(y_enc, a3, w1 + epsilon_ary1, w2)
                num_grad1[i, j] = (cost2 - cost1) / (2.0 * epsilon)
                epsilon_ary1[i, j] = 0

        num_grad2 = np.zeros(np.shape(w2))
        epsilon_ary2 = np.zeros(np.shape(w2))
        for i in range(w2.shape[0]):
            for j in range(w2.shape[1]):
                epsilon_ary2[i, j] = epsilon
                a1, z2, a2, z3, a3 = self._feedforward(X, w1,
                                                       w2 - epsilon_ary2)
                cost1 = self._get_cost(y_enc, a3, w1, w2 - epsilon_ary2)
                a1, z2, a2, z3, a3 = self._feedforward(X, w1,
                                                       w2 + epsilon_ary2)
                cost2 = self._get_cost(y_enc, a3, w1, w2 + epsilon_ary2)
                num_grad2[i, j] = (cost2 - cost1) / (2.0 * epsilon)
                epsilon_ary2[i, j] = 0

        num_grad = np.hstack((num_grad1.flatten(), num_grad2.flatten()))
        grad = np.hstack((grad1.flatten(), grad2.flatten()))
        norm1 = np.linalg.norm(num_grad - grad)
        norm2 = np.linalg.norm(num_grad)
        norm3 = np.linalg.norm(grad)
        relative_error = norm1 / (norm2 + norm3)
        return relative_error

    def predict(self, X):
        """Predict class labels

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            Input layer with original features.

        Returns:
        ----------
        y_pred : array, shape = [n_samples]
            Predicted class labels.

        """
        if len(X.shape) != 2:
            raise AttributeError('X must be a [n_samples, n_features] array.
'
                                 'Use X[:,None] for 1-feature classification,'
                                 '
or X[[i]] for 1-sample classification')

        a1, z2, a2, z3, a3 = self._feedforward(X, self.w1, self.w2)
        y_pred = np.argmax(z3, axis=0)
        return y_pred

    def fit(self, X, y, print_progress=False):
        """ Learn weights from training data.

        Parameters
        -----------
        X : array, shape = [n_samples, n_features]
            Input layer with original features.
        y : array, shape = [n_samples]
            Target class labels.
        print_progress : bool (default: False)
            Prints progress as the number of epochs
            to stderr.

        Returns:
        ----------
        self

        """
        self.cost_ = []
        X_data, y_data = X.copy(), y.copy()
        y_enc = self._encode_labels(y, self.n_output)

        delta_w1_prev = np.zeros(self.w1.shape)
        delta_w2_prev = np.zeros(self.w2.shape)

        for i in range(self.epochs):

            # adaptive learning rate
            self.eta /= (1 + self.decrease_const*i)

            if print_progress:
                sys.stderr.write('
Epoch: %d/%d' % (i+1, self.epochs))
                sys.stderr.flush()

            if self.shuffle:
                idx = np.random.permutation(y_data.shape[0])
                X_data, y_enc = X_data[idx], y_enc[idx]

            mini = np.array_split(range(y_data.shape[0]), self.minibatches)
            for idx in mini:

                # feedforward
                a1, z2, a2, z3, a3 = self._feedforward(X[idx],
                                                       self.w1,
                                                       self.w2)
                cost = self._get_cost(y_enc=y_enc[:, idx],
                                      output=a3,
                                      w1=self.w1,
                                      w2=self.w2)
                self.cost_.append(cost)

                # compute gradient via backpropagation
                grad1, grad2 = self._get_gradient(a1=a1, a2=a2,
                                                  a3=a3, z2=z2,
                                                  y_enc=y_enc[:, idx],
                                                  w1=self.w1,
                                                  w2=self.w2)

                # start gradient checking
                grad_diff = self._gradient_checking(X=X_data[idx],
                                                    y_enc=y_enc[:, idx],
                                                    w1=self.w1,
                                                    w2=self.w2,
                                                    epsilon=1e-5,
                                                    grad1=grad1,
                                                    grad2=grad2)


                if grad_diff <= 1e-7:
                    print('Ok: %s' % grad_diff)
                elif grad_diff <= 1e-4:
                    print('Warning: %s' % grad_diff)
                else:
                    print('PROBLEM: %s' % grad_diff)

                # update weights; [alpha * delta_w_prev] for momentum learning
                delta_w1, delta_w2 = self.eta * grad1, self.eta * grad2
                self.w1 -= (delta_w1 + (self.alpha * delta_w1_prev))
                self.w2 -= (delta_w2 + (self.alpha * delta_w2_prev))
                delta_w1_prev, delta_w2_prev = delta_w1, delta_w2

        return self

nn_check = MLPGradientCheck(n_output=10, 
                            n_features=X_train.shape[1], 
                            n_hidden=10, 
                            l2=0.0, 
                            l1=0.0, 
                            epochs=10, 
                            eta=0.001,
                            alpha=0.0,
                            decrease_const=0.0,
                            minibatches=1, 
                            shuffle=False,
                            random_state=1)

nn_check.fit(X_train[:5], y_train[:5], print_progress=False)

Ok: 2.547178604857207e-10
Ok: 3.1055945830621735e-10
Ok: 2.381983188561259e-10
Ok: 3.036785198608931e-10
Ok: 3.368345140751822e-10
Ok: 3.5875948399767924e-10
Ok: 2.198885901516091e-10
Ok: 2.337387773884223e-10
Ok: 3.287636609571923e-10
Ok: 2.1363722298076513e-10


<__main__.MLPGradientCheck at 0x21496926ba8>

六.神经网络的收敛性

七.其他神经网络架构

关于深度学习的神经网络和算法,推荐阅读阅读下列连接：Yoshua_Bengio

1.卷积神经网络

卷积神经网络(Convolutional Neural Network,CNN)在图像识别中的优异表现使其在计算机视觉领域日渐流行.卷积神经网络的核心理念在于构建多层特征检测器,以处理输入图片中像素间的空间排列.关于更多卷积神经网络的内容,建议阅读Yann LeCun相关著作

2.循环神经网络

循环神经网络(Recurrent Neural Network,RNN)可被理解为包含与时间相关的反馈循环或者反向传播的前馈神经网络.在循环神经网络中,神经元(暂时)释放前,它们只能在有限的时间内处于活动状态.反过来,这些神经元在下一个时间点又会激活其他神经元使之处于活动状态.本质上讲,我们可将循环神经网络看作包含额外时间变量的多层感知器.由于具备这样的时间属性和自身的动态结构,使得网络不光能接受当前的输入值,还可以接受此前的其他输入

尽管循环神经网络在语言识别,语言翻译,以及手写识别等领域成效显著,不过这种网络架构训练起来却相当困难.这是因为我们不能简单地将错误逐层反向传播,而必须考虑到时间维度,它放大了梯度消失与梯度爆炸的问题.在1997年,Juergen Schmidhuber和他同事提出了所谓的长短时记忆(long short-term memory)单元来解决此问题