03 Logistic Regression

Binary Classification

Define

Sigmoid Function Logistic Function
[ h_ heta(x) = g( heta^Tx) ]
[ z = heta^Tx ]
[ 0 <= g(z) = frac{1}{1 + e^{-z}} <= 1 ]
( h_ heta(x) ) the probability that the output is 1.
( h_ heta(x) = P(y = 1 | x; heta) )
( P(y = 0 | x; heta) + P(y = 1 | x; heta) = 1 )
设置 0.5为判定边界，则 (h_ heta(x)=0.5 <==> heta^Tx = 0)

Cost Function

[J( heta) = dfrac{1}{m} sum_{i=1}^m mathrm{Cost}(h_ heta(x^{(i)}),y^{(i)})]
[ mathrm{Cost}(h_ heta(x),y) = -log(h_ heta(x)) , ext{if y = 1} ]
[ mathrm{Cost}(h_ heta(x),y) = -log(1-h_ heta(x)), ext{if y = 0} ]
[ Cost(h_ heta(x), y) = -ylog(h_ heta(x)) - (1 - y)log(1 - h_ heta(x)) ]

Algorithm

(egin{align*} & Repeat ; lbrace ewline & ; heta_j := heta_j - frac{alpha}{m} sum_{i=1}^m (h_ heta(x^{(i)}) - y^{(i)}) x_j^{(i)} ewline & brace end{align*})

虽然跟梯度下降相同但两者(h_ heta(x))的定义并不相同
逻辑回归也可以通过特征缩放来加快收敛速度

可用于计算 ( heta) 的算法
- Gradient descent
- Conjugate gradient
- BFGS 共轭梯度法（变尺度法）
- L-BFGS 限制变尺度法
- 后三种算法的特性
Advantages:
a.no need to manually pick (alpha)
b.often faster than gradient descent
Disadvantages:
More complex
Octave 的优化算法使用

%exitFlag: 1 收敛
%R(optTheta) >= 2
options = optimset(‘GradObj’, ‘on’, ‘MaxIter’, ‘100’);
initialTheta = zeros(2, 1);
[optTheta, functionVal, exitFlag] ...
    = fminumc(@costFunction, initialTheta, options);
    
%costFunction:
function [jVal, gradient] = costFunction(theta)
    jVal = ... %cost function
    gradient = zeros(n, 1); %gradient
    
    gradient(1) = ...
    ...
    gradient(n) = ...

Multi-class classification

one-vs-all `one-vs-rest`

Train a logistic regression classifier (h_ heta^{(i)}(x)) for each class (i) to predict the probability that (y = i).
On a new input (x), to make a prediction, pick the class (i) that maximizes (max limits_ih_ heta^{(i)}(x)).

03 Logistic Regression

Binary Classification

Define

Cost Function

Algorithm

Multi-class classification

one-vs-all one-vs-rest

one-vs-all `one-vs-rest`