CheeseZH: Stanford University: Machine Learning Ex4:Training Neural Network(Backpropagation Algorithm)

1. Feedforward and cost function;

$J( heta)= frac{1}{m}sum_{i=1}^{m}sum_{k=1}^{K}[-y_{k}^{(i)}log((h_{ heta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h_{ heta}(x^{(i)}))_{k})]$

2.Regularized cost function:

$J( heta) = frac{1}{m}sum_{i=1}^{m}sum_{k=1}^{K}[-y_{k}^{(i)}log((h_{ heta}(x^{(i)}))_{k})-(1-y_{k}^{(i)})log(1-(h_{ heta}(x^{(i)}))_{k})]+frac{lambda}{2m}[sum_{j=1}^{25}sum_{k=1}^{400}(Theta_{j,k}^{(1)})^{2}+sum_{j=1}^{10}sum_{k=1}^{25}(Theta_{j,k}^{(2)})^{2}]$

3.Sigmoid gradient

The gradient for the sigmoid function can be computed as:

$g'(z)=frac{d}{dz}g(z)=g(z)(1-g(z))$

where:

$sigmoid(z)=g(z)=frac{1}{1+e^{-z}}$

4.Random initialization

randInitializeWeights.m

 1 function W = randInitializeWeights(L_in, L_out)
 2 %RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
 3 %incoming connections and L_out outgoing connections
 4 %   W = RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights 
 5 %   of a layer with L_in incoming connections and L_out outgoing 
 6 %   connections. 
 7 %
 8 %   Note that W should be set to a matrix of size(L_out, 1 + L_in) as
 9 %   the column row of W handles the "bias" terms
10 %
11 
12 % You need to return the following variables correctly 
13 W = zeros(L_out, 1 + L_in);
14 
15 % ====================== YOUR CODE HERE ======================
16 % Instructions: Initialize W randomly so that we break the symmetry while
17 %               training the neural network.
18 %
19 % Note: The first row of W corresponds to the parameters for the bias units
20 %
21 epsilon_init = 0.12;
22 W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;
23 
24 % =========================================================================
25 
26 end

View Code

5.Backpropagation(using a for-loop for t=1:m and place steps 1-4 below inside the for-loop), with the t^th iteration perfoming the calculation on the t^th training example(x^(t),y^(t)).Step 5 will divide the accumulated gradients by m to obtain the gradients for the neural network cost function.

(1) Set the input layer's values(a⁽¹⁾) to the t-th training example x^(t). Perform a feedforward pass, computing the activations(z⁽²⁾,a⁽²⁾,z⁽³⁾,a⁽³⁾) for layers 2 and 3.

(2) For each output unit k in layer 3(the output layer), set :

$delta_{k}^{(3)}=(a_{k}^{(3)}-y_{k})$

where y_k = 1 or 0.

(3)For the hidden layer l=2, set:

$delta^{(2)}=(Theta^{(2)})^{T}delta^{(3)}.*g'(z^{(2)})$

(4) Accumulate the gradient from this example using the following formula. Note that you should skip or remove δ₀(2).

$Delta^{(l)}=Delta^{(l)}+delta^{(l+1)}(a^{ ext{(l)}})^{T}$

(5) Obtain the(unregularized) gradient for the neural network cost function by dividing the accumulated gradients by 1/m:

$frac{partial}{partialTheta_{ij}^{(l)}}J(Theta)=D_{ij}^{(l)}=frac{1}{m}Delta_{ij}^{(l)}$

nnCostFunction.m

  1 function [J grad] = nnCostFunction(nn_params, ...
  2                                    input_layer_size, ...
  3                                    hidden_layer_size, ...
  4                                    num_labels, ...
  5                                    X, y, lambda)
  6 %NNCOSTFUNCTION Implements the neural network cost function for a two layer
  7 %neural network which performs classification
  8 %   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
  9 %   X, y, lambda) computes the cost and gradient of the neural network. The
 10 %   parameters for the neural network are "unrolled" into the vector
 11 %   nn_params and need to be converted back into the weight matrices. 
 12 % 
 13 %   The returned parameter grad should be a "unrolled" vector of the
 14 %   partial derivatives of the neural network.
 15 %
 16 
 17 % Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
 18 % for our 2 layer neural network
 19 Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
 20                  hidden_layer_size, (input_layer_size + 1));
 21 
 22 Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
 23                  num_labels, (hidden_layer_size + 1));
 24 
 25 % Setup some useful variables
 26 m = size(X, 1);
 27          
 28 % You need to return the following variables correctly 
 29 J = 0;
 30 Theta1_grad = zeros(size(Theta1));
 31 Theta2_grad = zeros(size(Theta2));
 32 
 33 % ====================== YOUR CODE HERE ======================
 34 % Instructions: You should complete the code by working through the
 35 %               following parts.
 36 %
 37 % Part 1: Feedforward the neural network and return the cost in the
 38 %         variable J. After implementing Part 1, you can verify that your
 39 %         cost function computation is correct by verifying the cost
 40 %         computed in ex4.m
 41 %
 42 % Part 2: Implement the backpropagation algorithm to compute the gradients
 43 %         Theta1_grad and Theta2_grad. You should return the partial derivatives of
 44 %         the cost function with respect to Theta1 and Theta2 in Theta1_grad and
 45 %         Theta2_grad, respectively. After implementing Part 2, you can check
 46 %         that your implementation is correct by running checkNNGradients
 47 %
 48 %         Note: The vector y passed into the function is a vector of labels
 49 %               containing values from 1..K. You need to map this vector into a 
 50 %               binary vector of 1's and 0's to be used with the neural network
 51 %               cost function.
 52 %
 53 %         Hint: We recommend implementing backpropagation using a for-loop
 54 %               over the training examples if you are implementing it for the 
 55 %               first time.
 56 %
 57 % Part 3: Implement regularization with the cost function and gradients.
 58 %
 59 %         Hint: You can implement this around the code for
 60 %               backpropagation. That is, you can compute the gradients for
 61 %               the regularization separately and then add them to Theta1_grad
 62 %               and Theta2_grad from Part 2.
 63 %
 64 
 65 %Part 1
 66 %Theta1 has size 25*401
 67 %Theta2 has size 10*26
 68 %y hase size 5000*1
 69 K = num_labels;
 70 Y = eye(K)(y,:); %[5000 10]
 71 a1 = [ones(m,1),X];%[5000 401]
 72 a2 = sigmoid(a1*Theta1'); %[5000 25]
 73 a2 = [ones(m,1),a2];%[5000 26]
 74 h = sigmoid(a2*Theta2');%[5000 10]
 75 
 76 costPositive = -Y.*log(h);
 77 costNegtive = (1-Y).*log(1-h); 
 78 cost = costPositive - costNegtive;
 79 J = (1/m)*sum(cost(:));
 80 %Regularized
 81 Theta1Filtered = Theta1(:,2:end); %[25 400]
 82 Theta2Filtered = Theta2(:,2:end); %[10 25]
 83 reg = (lambda/(2*m))*(sumsq(Theta1Filtered(:))+sumsq(Theta2Filtered(:)));
 84 J = J + reg;
 85 
 86 
 87 %Part 2
 88 Delta1 = 0;
 89 Delta2 = 0;
 90 for t=1:m,
 91   %step 1
 92   a1 = [1 X(t,:)]; %[1 401]
 93   z2 = a1*Theta1'; %[1 25]
 94   a2 = [1 sigmoid(z2)];%[1 26]
 95   z3 = a2*Theta2'; %[1 10]
 96   a3 = sigmoid(z3); %[1 10]
 97   %step 2
 98   yt = Y(t,:);%[1 10]
 99   d3 = a3-yt; %[1 10]
100   %step 3
101   %   [1 10]  [10 25]           [1 25]
102   d2 = (d3*Theta2Filtered).*sigmoidGradient(z2); %[1 25]
103   %step 4
104   Delta1 = Delta1 + (d2'*a1);%[25 401]
105   Delta2 = Delta2 + (d3'*a2);%[10 26]  
106 end;
107 
108 %step 5
109 Theta1_grad = (1/m)*Delta1;
110 Theta2_grad = (1/m)*Delta2;
111 
112 %Part 3
113 Theta1_grad(:,2:end) = Theta1_grad(:,2:end) + ((lambda/m)*Theta1Filtered);
114 Theta2_grad(:,2:end) = Theta2_grad(:,2:end) + ((lambda/m)*Theta2Filtered);
115 
116 % -------------------------------------------------------------
117 
118 % =========================================================================
119 
120 % Unroll gradients
121 grad = [Theta1_grad(:) ; Theta2_grad(:)];
122 
123 
124 end

View Code

6.Gradient checking

Let

$heta^{(i+)}= heta+left[egin{array}{c}0\0\vdots\epsilon\vdots\0end{array} ight]$

and

$heta^{(i-)}= heta-left[egin{array}{c}0\0\vdots\epsilon\vdots\0end{array} ight]$

for each i, that:

$f_{i}( heta) hickapproxfrac{J( heta^{(i+)})-J( heta^{(i-)})}{2epsilon}$

computeNumericalGradient.m

 1 function numgrad = computeNumericalGradient(J, theta)
 2 %COMPUTENUMERICALGRADIENT Computes the gradient using "finite differences"
 3 %and gives us a numerical estimate of the gradient.
 4 %   numgrad = COMPUTENUMERICALGRADIENT(J, theta) computes the numerical
 5 %   gradient of the function J around theta. Calling y = J(theta) should
 6 %   return the function value at theta.
 7 
 8 % Notes: The following code implements numerical gradient checking, and 
 9 %        returns the numerical gradient.It sets numgrad(i) to (a numerical 
10 %        approximation of) the partial derivative of J with respect to the 
11 %        i-th input argument, evaluated at theta. (i.e., numgrad(i) should 
12 %        be the (approximately) the partial derivative of J with respect 
13 %        to theta(i).)
14 %                
15 
16 numgrad = zeros(size(theta));
17 perturb = zeros(size(theta));
18 e = 1e-4;
19 for p = 1:numel(theta)
20     % Set perturbation vector
21     perturb(p) = e;
22     loss1 = J(theta - perturb);
23     loss2 = J(theta + perturb);
24     % Compute Numerical Gradient
25     numgrad(p) = (loss2 - loss1) / (2*e);
26     perturb(p) = 0;
27 end
28 
29 end

View Code

7.Regularized Neural Networks

for j=0:

$frac{partial}{partialTheta_{ij}^{(l)}}J(Theta)=D_{ij}^{(l)}=frac{1}{m}Delta_{ij}^{(l)}$

for j>=1:

$frac{partial}{partialTheta_{ij}^{(l)}}J(Theta)=D_{ij}^{(l)}=frac{1}{m}Delta_{ij}^{(l)}+frac{lambda}{m}Theta_{ij}^{(l)}$

别人的代码：

https://github.com/jcgillespie/Coursera-Machine-Learning/tree/master/ex4