特征归一化、特征映射、正则化

特征归一化，特征映射，正则化

特征归一化(Feature Normalize/Feature Scaling)

应用简介

当数据集的各个属性之间的值分布差别较大时，运用梯度下降算法求解局部最优解时会需要很小的学习率以及多次迭代才能达到最优解。因此，使用特征归一化主要有以下两条作用：

归一化后加快了梯度下降求最优解的速度；
归一化有可能提高精度

常见类型

最大最小标准化（Min-Max Normalization）

适用于本身就分布在有限范围内的数据

[x_i^{(j)} = frac{x_i^{(j)}-min{(x_i)}}{max{(x_i)-min{(x_i)}}} ]
均值方差归一化 (Cepstral mean and variance normalization,CMVN)

适用于分布没有明显边界的数据

[x_i^{(j)} = frac{x_i^{(j)}-mu_i}{sigma_i} ]

其中

[left{ egin{align} mu_i &= frac{1}{m} * sum_{j=1}^{m} x_i^{(j)} \ sigma_i &= sqrt{frac{sum_{j=1}^{m}(x_i^{(j)}-mu_i)^2}{m}} end{align} ight. ]

在使用均值方差归一化后，需要记录(mu_i,sigma_i)以便后期使用。下面展示均值方差归一化的具体实现

function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X 
%   FEATURENORMALIZE(X) returns a normalized version of X where
%   the mean value of each feature is 0 and the standard deviation
%   is 1. This is often a good preprocessing step to do when
%   working with learning algorithms.

% You need to set these values correctly
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));

% ====================== YOUR CODE HERE ======================
% Instructions: First, for each feature dimension, compute the mean
%               of the feature and subtract it from the dataset,
%               storing the mean value in mu. Next, compute the 
%               standard deviation of each feature and divide
%               each feature by it's standard deviation, storing
%               the standard deviation in sigma. 
%
%               Note that X is a matrix where each column is a 
%               feature and each row is an example. You need 
%               to perform the normalization separately for 
%               each feature. 
mu = mean(X); % X:m*2 , mu: 2 vector mu(1) : X_1) mu(2) : X_2
sigma = std(X);
X_norm(:,1) = (X_norm(:,1) - mu(1)) / sigma(1);
X_norm(:,2) = (X_norm(:,2) - mu(2)) / sigma(2);
% Hint: You might find the 'mean' and 'std' functions useful.
%       
% =========================================================

end

调用方式

[X, mu, sigma] = featureNormalize(X);

特征映射(Feature Mapping)

特征映射用于制造非线性回归复杂属性。通过循环将原本的输入值矩阵扩展成多项展开式的形式。这样做能够获得不同于线性回归的更加复杂、合理的目标函数。

function out = mapFeature(X1, X2)
% MAPFEATURE Feature mapping function to polynomial features
%
%   MAPFEATURE(X1, X2) maps the two input features
%   to quadratic features used in the regularization exercise.
%
%   Returns a new feature array with more features, comprising of 
%   X1, X2, X1.^2, X2.^2, X1*X2, X1*X2.^2, etc..
%
%   Inputs X1, X2 must be the same size
%

degree = 6;
out = ones(size(X1(:,1)));
for i = 1:degree
    for j = 0:i
        out(:, end+1) = (X1.^(i-j)).*(X2.^j);
    end
end

end

过拟合与正则化(Overfitting and Regularization)

每一个数据集都有可能出现一些异常样本，它们虽然也是真实的数据，但不满足其余大多数样本所共同构成的规律。例如，在面积-房价问题上，有可能出现某一个房子的面积很小，但是很贵，或者某一个房子面积很大，但十分便宜。再比如，加入我们要判断一个西瓜是否是好瓜，可供参考的属性包括「色泽，根蒂，纹理，形状」，对于其中「形状」这一属性，从直观上来考虑其对好瓜的影响较小，但特定的样本可能导致拟合出的参数受形状的影响过多。

对于这类异常样本，如果学习过深就会强行使得构造出的目标函数通过或逼近这些异常样本，构造出来一个经验误差很小，而泛化误差很大的模型。这样的现象称为过拟合。显然，过拟合是一种不符合普遍规律的错误拟合，为了避免出现过拟合现象，一般采用正则化技术。

线性回归正则化

对于线性回归，我们引入惩罚系数(lambda),以及惩罚项 (lambda * sum_{j=1}^{n} heta_j^2)

梯度下降正则化

[J( heta) = frac{1}{2m}[sum_{i=1}^m(h_ heta(x^{(i)})-y^{(i)})^2 + lambdasum_{j=1}^n heta_j^2] ]
正规方程正则化

[Theta = ext{pinv}(X^TX + lambda egin{bmatrix} 0 & 0 & 0 & 0 & ... & 0 \ 0 & 1 & 0 & 0 & ... & 0 \ 0 & 0 & 1 & 0 & ... & 0 \ 0 & 0 & 0 & 1 &... \ :\ :\ 0 & 0 & ... & ... &0 & 1 end{bmatrix}) * X^T * Y ]

分类正则化

对分类问题的代价函数(J( heta))添加正则项(frac{lambda}{2m}sum_{j=1}^{n} heta_j^2) 来正则化

截屏2020-09-17 下午8.58.35

截屏2020-09-17 下午8.59.01

示例代码：

function [J, grad] = costFunction(theta, x, y, lambda)
%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
%   J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
%   theta as the parameter for regularized logistic regression and the
%   gradient of the cost w.r.t. to the parameters. 

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;
grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
%               You should set J to the cost.
%               Compute the partial derivatives and set grad to the partial
%               derivatives of the cost w.r.t. each parameter in theta


J = -1/m * sum(y.*log(sigmoid(x*theta))+(1-y).*log(1-sigmoid(x*theta))) + lambda/(2*m)*sum(theta.*theta);
grad = 1/m * x' * (sigmoid(x*theta)-y) + lambda / m * theta;

% =============================================================

end

---- suffer now and live the rest of your life as a champion ----