Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化

Classification

It's not a good idea to use linear regression for classification problem.

We can use logistic regression algorism, which is a classification algorism

想要(0le h_{ heta}(x) le 1), 只需要使用sigmoid function (又称为logistic function)

[large h_ heta(x) = g( heta^Tx), quad其中;g(z) =frac{1}{1+e^{-z}} ]

(h_ heta(x))的意义在于: (h_ heta(x)) = estimated probability that (y = 1) on input (x)

注意:(x=0)时,(g(z))刚好等于0.5

Decision Boundary

(h_ heta{(x)} == P{y=1|x;0 }) ((P)指预测的概率)

​ 在课上的例子中,(h_ heta(x) ge 0.5,则y=1, else; y=0)

​ 不妨设( heta = egin{bmatrix}-3\ 1\ 1 end{bmatrix} ,则 h_ heta(x)=g(-3+x_1+x_2))

​ 由于"(y=1)" == "(h_ heta(x) ge 0.5)" == "( heta^Tx ge 0)" == "(-3+x_1+x_2 ge 0)"

这样的到了 "(y=1)" == "(x_1+x_2 ge 3)"

(x_1+x_2)(3) 的关系决定了 (y) 的值,这就是Decision boundary(决策边界)

拓展到 Non-linear decision boundary:

​ 还可以有:Predict "(y=1)" if (-1+x_1^2+x_2^2 ge 0) (( heta = egin{bmatrix}-1\ 0\ 0 \ 1\ 1 end{bmatrix},;x = egin{bmatrix}x_0\ x_1\ x_2\ x_3 \ x_4 end{bmatrix} = egin{bmatrix}1\ x_1\ x_2\ x_1^2 \ x_2^2 end{bmatrix}))

​ 通过( heta)的不同选择与(x)的不同构造可以得到各种形状的决策边界

​ 而Decision Boundary 取决于参数 ( heta) 的选择,并非由训练集决定

​ 我们需要用训练集来拟合参数 ( heta)

Cost Function

[egin{align} &J( heta) =frac{1}{m}sum_{i=1}^{m}Cost(h_ heta(x^{(i)}),y^{(i)})end{align} ]

在之前的 linear regression 中,用的Cost函数是:$Cost(h_ heta(x,y)) = frac{1}{2}(h_ heta(x,y))^2 $

但那不是通用的,在hypothesis function (h_ heta(x))不再是线性方程的情况下,若再采用$Cost(h_ heta(x,y)) = frac{1}{2}(h_ heta(x,y))^2 (会导致)J( heta)$ 有着众多的local optima,而不是我们想要的convex function

Logistic Regression Cost Function

[Cost(h_ heta(x),y) = egin{cases} egin{align} {-log(h_ heta(x))} &quad ext{ if $y$ = 1} \ {-log(1-h_ heta(x))} &quad ext{ if $y$ = 0} end{align} end{cases} ]

(h_ heta(x)=y) 时,(Cost(h_ heta(x,y))=0),

(y=1,h_ heta(x) ightarrow0)(Cost ightarrow infty),此时:( heta^Tx ightarrow -infty)

(y=0,h_ heta(x) ightarrow1)(Cost ightarrow infty),此时:( heta^Tx ightarrow infty)

这样就保证了( heta)的调整能使得(h_ heta(x))(y) 靠近,也就是预测效果与实际更加符合

上面的(Cost) function 也可以写成:

[Cost(h_ heta(x),y) = -ycdot log(h_ heta(x))-(1-y)cdot log(1-h_ heta(x)) ]

这与之前的cases形式是等价的

所以:

[egin{align} J( heta) &=frac{1}{m}sum_{i=1}^{m}Cost(h_ heta(x^{(i)}),y^{(i)})\ &= -frac{1}{m}[sum_{i=1}^{m}y^{(i)}cdot log(h_ heta(x^{(i)}))+(1-y^{(i)})cdot log(1-h_ heta(x^{(i)}))] end{align} ]

Gradient Descent Algorithm的通用形式还是跟linear regression的一样(当然把(h_ heta(x))展开后就不一样了):

[egin{align}& ext{Repeat{} \ &qquad heta_j := heta_j - alphasum_{i=1}^{m}(h_ heta(x^{(i)})-y^{(i)})x_j^{(i)}\ &} end{align} ]

Other Optimization Algorism

  • Conjugate Algorism(共轭梯度法)
  • BFGS(Broyden–Fletcher–Goldfarb–Shanno algorithm)
  • L-BFGS( Limited-memory BFGS)

advantage:

  • no need to manually pick (alpha)
  • Often faster than gradient descent

disadvantage:

  • More complex

不建议自己写,但是...可以直接调库啊

%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
	jVal = [code to compute J(theta)]
	gradient = zeros(n+1,1)
	gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]] 
	gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
	...
	gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1
%}

options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification:

用one-vs-all(一对多/一对余)的思想

对每一类都分成"这一类" 与 "剩下的所有类的集合" 两类,然后用之前的课程中讲得分类方法拟合出这一类的分类器(classifier)

(classifier 就是hypothesis)

最后得出(n)个classifiers, 其中(n)是类别的总数量, (y)是类别:

[h_ heta^{(i)}(x) = P(y=i|x; heta)qquad (i=1,2,3,dots,n) ]

也就是说,给定(x)( heta)(h_ heta^{(i)}(x)) 能算出来类别是(i)类的概率

然后输入一个新的input (x)时,作出预测的行为是:(underbrace{max}_i(h_ heta^{(i)}(x)))

Regularization (正则化)

解决overfitting(过拟合)的问题,另一个描述这个问题的词语是high variance(高方差)

这是 过多变量(feature)+ 过少训练数据 造成的

​ If we have too many features, the learned hypothesis may fit the training set very well((J( heta) approx 0))

generalize:  how well a hypothesis applies even to new examples

Option to address overfitting:

  • Reduce number of features:
    • Manually select which features to keep
    • Model selection algorism
  • Regularization:
    • Keep all features, but reduce magnitude(大小)/values of parameters ( heta_j)
    • Works well when having a lot of features , each of which contributes a bit to predicting (y)

regularized Linear Regression

Regularization 的思路:

Small values for parameters ( heta_0, heta_1,dots, heta_n):

  • "Simpler" hypothesis
  • Less prone to overfitting

也就是将某些影响过大的( heta_j)设得很小,比如: ( heta_0 + heta_1x + heta_2x^2 + heta_3x^3 + heta_4x^4 approx heta_0 + heta_1x + heta_2x^2)

Gradient Descent

但是这个regularization 的过程不是在 (h_ heta(x)) 里进行的,而是在Cost Function (J( heta))里进行的:

[large J( heta) =frac{1}{2m} [sum_{i=1}^{m}(h_ heta(x^{(i)})-y^{(i)})^2 + lambdasum_{j=1}^{n} heta_j^2 ] ]

注意后面加上的那一项(称之为正则化项)是从1开始的,它收缩了除了( heta_0)外的每一个参数。 (lambda) 称为regularization parameter(正则化参数),用于控制两个不同目标之间的平衡关系。

在这个cost functions 里两个(sum)项代表了两个不同的目标:

  • 使假设更好地拟合数据(fit the training data well)
  • 保持参数值较小(keep the parameters small)

较小的参数值能得到简单的hypothesis,从而避免overfitting

注意:(lambda)不能过大,否则会使得 ( heta_1,dots , heta_n approx 0), 从而fail to fit even the training set ——too high bias——underfitting(欠拟合)

[egin{align} & ext{repeat until convergence}{qquadqquadqquadqquadqquad\ &qquad heta_{0}; ext{:= } heta_{0} - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_0^{(i)} \ &qquad heta_{j}; ext{:= } heta_{j} - alpha[frac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_j^{(i)} + frac{lambda}{m} heta_j] qquad (j = 1,2...,n)\ &} end{align} ]

亦即

[egin{align} & ext{repeat until convergence}{qquadqquadqquadqquadqquad\ &qquad heta_{0}; ext{:= } heta_{0} - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_0^{(i)} \ &qquad heta_{j}; ext{:= } heta_{j}(1-alphafrac{lambda}{m}) - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_j^{(i)}qquad (j = 1,2...,n)\ &} end{align} ]

Normal Equation

review: 之前的Normal Equation是 ( heta = (X^TX)^{-1}X^Ty)

改成( heta = (X^TX+lambda small{egin{bmatrix}0 \&1 \ &&1\&&&ddots\&&&&1 end{bmatrix}})^{-1}X^Ty,quad large ext{if }lambda gt 0)

关于不可逆/退化矩阵 的问题,还是用Octave中的pinv()可以取伪逆矩阵

但是只要确保(lambda)严格大于0,就能证明括号里的两个矩阵的和是可逆的.....

Regularized Logistic Regression

review: $ J( heta) = -frac{1}{m}[sum_{i=1}{m}y{(i)}, log,h_ heta(x{(i)})+(1-y{(i)}), log,(1-h_ heta(x^{(i)}))]$

处理方法与Linear Regression 的一样,都是在式子最后面加上一个正则化项 (frac{lambda}{2m}sum_{j=1}^m heta_j^2)

[J( heta) = -frac{1}{m}[sum_{i=1}^{m}y^{(i)}\, log\,h_ heta(x^{(i)})+(1-y^{(i)})\, log\,(1-h_ heta(x^{(i)}))] + frac{lambda}{2m}sum_{j=1}^m heta_j^2 ]

Gradient Descent(general 形式跟Linear Regression的一样,区别还是只有(h_ heta(x^{(i)}))不同):

[egin{align} & ext{repeat until convergence}{qquadqquadqquadqquadqquad\ &qquad heta_{0}; ext{:= } heta_{0} - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_0^{(i)} \ &qquad heta_{j}; ext{:= } heta_{j} - alpha[frac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_j^{(i)} + frac{lambda}{m} heta_j] qquad (j = 1,2...,n)\ &} end{align} ]

在Octave中还是用之前的代码模版就行,注意在算(frac{partial J( heta)}{partial heta_j};(small j=1,2,dots,n))时需要注意把正则化项的偏微分加上

%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
	jVal = [code to compute J(theta)]
	gradient = zeros(n+1,1)
	gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]] 
	gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
	...
	gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1
%}

options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);
原文地址:https://www.cnblogs.com/khunkin/p/10199384.html