逻辑回归

逻辑回归

逻辑回归（Logistics Regression）,用来解决分类问题，那回归怎么解决分类问题？

将样本特征和样本发生的概率联系起来，概率是一个数。

这是一个统计数据，Logistic Regression是最广泛使用的一种算法。

mark

一、认识Logistic Regression

1565579524029

逻辑回归通常既可以看做回归算法，又可以看做分类算法，通常作为分类算法，只可以解决二分类问题。

mark

通常我们在直线回归过程中使用第一行的公式，但是他的值域是从（-infineity， +infinity）而所需的概率的值域为[0,1],因此做一下改进，

mark

接下里绘制一下这个函数的图形：

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(t):
    return 1 / (1 + np.exp(-t))

x = np.linspace(-10, 10, 500)
y = sigmoid(x)
plt.plot(x, y)
plt.show()

mark

那么问题来了：

对于给定的样本数据集x，y如何找到参数 heta，使得使用这样的方式，可以最大程度地获得样本数据集x，对应的分类输出y？

二、Logistic Regression的损失函数

mark

mark 这个函数是一个凸函数，他没有公式解，只能使用梯度下降法求解。

三、Logistic Regression的损失函数的梯度

mark

对于这种复杂的损失函数，我们首先来看一个sigmoid函数的求导。

mark

四、编程实现Logistic Regression

import numpy as np
from play_ML.multi_linear_regression.metrics import accuracy_score

class LogisticRegression(object):

    def __init__(self):
        "初始化logistic回归模型"
        self.coef_ = None
        self.interception_ = None
        self._theta = None

    def _sigmoid(self, t):
        return 1. / (1. + np.exp(-t))

    def fit(self, x_train, y_train, eta=0.01, n_iters=1e4):
        assert x_train.shape[0] == y_train.shape[0], "the size of x_train must be equal to the size of y_train"

        def J(theta, X_b, y):
            y_hat = self._sigmoid(X_b.dot(theta))
            try:
                return -np.sum(y*np.log(y_hat) + (1-y)*np.log(1-y_hat)) / len(y)
            except:
                return float('inf')

        def dj(theta, X_b, y):
            return X_b.T.dot(self._sigmoid(X_b.dot(theta)) - y) / len(X_b)

        def gradient_descent(X_b, y, init_theta, eta, n_iters=1e4, epsilon=1e-8):

            theta = init_theta
            i_iter = 0

            while i_iter < n_iters:
                gradient = dj(theta, X_b, y)
                last_theta = theta
                theta = last_theta - eta * gradient

                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break

                i_iter += 1

            return theta

        X_b = np.hstack([np.ones((len(x_train), 1)), x_train])
        init_theta = np.zeros(X_b.shape[1])
        self._theta = gradient_descent(X_b, y_train, init_theta, eta, n_iters=1e4)
        self.interception_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def predict_prob(self, x_predict):
        assert self.interception_ is not None and self.coef_ is not None, "must fit before predict"
        assert x_predict.shape[1] == len(self.coef_), "the feature number must be equal to x_train"

        X = np.hstack([np.ones((len(x_predict), 1)), x_predict])
        return self._sigmoid(X.dot(self._theta))

    def predict(self, x_predict):
        assert self.interception_ is not None and self.coef_ is not None, "must fit before predict"
        assert x_predict.shape[1] == len(self.coef_), "the feature number must be equal to x_train"

        prob = self.predict_prob(x_predict)
        return np.array(prob >= 0.5, dtype='int')

    def score(self, x_test, y_test):
        y_preict = self.predict(x_test)
        return accuracy_score(y_test, y_preict)

    def __repr__(self):
        return "Logistic Regression"

测试一下：

if __name__ == '__main__':
    import numpy as np
    from sklearn import datasets
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split

    iris = datasets.load_iris()
    x = iris.data
    y = iris.target
    x = x[y<2, :2]
    y = y[y<2]

    plt.scatter(x[y==0, 0], x[y==0, 1], color='red')
    plt.scatter(x[y==1, 0], x[y==1, 1], color='blue')
    plt.show()

    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
    log_reg = LogisticRegression()
    log_reg.fit(x_train, y_train)
    print(log_reg.score(x_test, y_test))
    print(log_reg.predict_prob(x_test))
    print(y_test)
    print(log_reg.coef_)
    print(log_reg.interception_)

输出结果：

mark

1.0
# 由此可以看出Logistic的简单的鸢尾花数据集具有良好的表现。后续会用复杂的数据集尽心测试。
# 下面分别是Logistic预测出来的结果，我们以0.5将数据分为两类，通过对比真实标签可以看出预测与实际的差距
[0.93292947 0.98717455 0.15541379 0.01786837 0.03909442 0.01972689
 0.05214631 0.99683149 0.98092348 0.75469962 0.0473811  0.00362352
 0.27122595 0.03909442 0.84902103 0.80627393 0.83574223 0.33477608
 0.06921637 0.21582553 0.0240109  0.1836441  0.98092348 0.98947619
 0.08342411]
[1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0]
# 其实Logistic回归算法是从线性回归改进而来的，同样我们得到类似相性回归的系数和截距，那它有什么几何意义呢？其实就是决策边界。cc
[ 3.01749692 -5.03046934]
-0.6827383698993108

五、决策边界

1、Logistic Regression的决策边界

首先回顾一下Logistic Regression进行分类的原理：

mark

接下来再看一下Sigmoid这个函数

mark

1565619682858

mark

def x2(x1):
    return (-log_reg.coef_[0] * x1 - log_reg.interception_) / log_reg.coef_[1]

x1_plot = np.linspace(4, 8, 1000)
x2_plot= x2(x1_plot)
plt.scatter(x[y==0, 0], x[y==0, 1], color='red')
plt.scatter(x[y==1, 0], x[y==1, 1], color='blue')
plt.plot(x1_plot, x2_plot)
plt.show()

mark

第四节预测明明是100%的准确率为什么还会有个红点在决策边界下方呢？因为这是训练集，接下里试一下测试集：

mark

那就把这个绘制决策边界封装成一个函数，

def plot_decision_boundary(model, axis):
    x0, x1 = np.meshgrid(np.linspace(axis[0], axis[1], int((axis[1] - axis[0])*100)).reshape(1, -1),
                         np.linspace(axis[2], axis[3], int((axis[3] - axis[2])*100)).reshape(1, -1),)
    x_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.predict(x_new)
    zz = y_predict.reshape(x0.shape)
    
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A', '#FFF59D', '#90CAF9'])
    
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)

plot_decision_boundary(log_reg, axis=[4, 7.5, 1.5, 4.5])
plt.scatter(x[y==0, 0], x[y==0, 1], color='red')
plt.scatter(x[y==1, 0], x[y==1, 1], color='blue')
plt.show()

mark

很显然，这样的决策边界是一条直线，那不规则的决策边界如何绘制呢？下面会提到。

2、KNN的决策边界

第一种情况，knn作用在两个类别上：

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(x_train, y_train)

knn_clf.score(x_test, y_test)
# 1.0
plot_decision_boundary(knn_clf, axis=[4, 7.5, 1.5, 4.5])
plt.scatter(x[y==0, 0], x[y==0, 1], color='red')
plt.scatter(x[y==1, 0], x[y==1, 1], color='blue')
plt.show()

mark

第二种情况，knn作用在三个类别上：

from sklearn.neighbors import KNeighborsClassifier

knn_clf_all = KNeighborsClassifier()
knn_clf_all.fit(iris.data[:,:2], iris.target)
#KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
#          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
#          weights='uniform')

plot_decision_boundary(knn_clf_all, axis=[4, 8, 1.5, 4.5])
plt.scatter(iris.data[iris.target==0, 0], iris.data[iris.target==0, 1], color='red')
plt.scatter(iris.data[iris.target==1, 0], iris.data[iris.target==1, 1], color='blue')
plt.scatter(iris.data[iris.target==2, 0], iris.data[iris.target==2, 1], color='blue')
plt.show()

mark

通过上面的图可以发现KNN其实已经过拟合了，这是因为在实例化KNN的时候默认使用的k=5，k越小模型就越复杂,k越大模型越简单。复杂的表现就是决策边界不规整，那么把k调大试一下。

from sklearn.neighbors import KNeighborsClassifier

knn_clf_all = KNeighborsClassifier(n_neighbors=50)
knn_clf_all.fit(iris.data[:,:2], iris.target)

plot_decision_boundary(knn_clf_all, axis=[4, 8, 1.5, 4.5])
plt.scatter(iris.data[iris.target==0, 0], iris.data[iris.target==0, 1], color='red')
plt.scatter(iris.data[iris.target==1, 0], iris.data[iris.target==1, 1], color='blue')
plt.scatter(iris.data[iris.target==2, 0], iris.data[iris.target==2, 1], color='blue')
plt.show()

mark

七、逻辑回归中使用多项式特征

逻辑回归中如果使用直线分类方式就只能针对二分类了，如果像下图中不可能使用一根直线完成分割，但是很显然可以使用圆形或者椭圆形完整这个分类任务。其实在线性回归到多项式回归我们思想就是给训练数据集添加多项式项。同理我们把这个东西用到逻辑回归中。

mark

首先生成需要的数据：

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x = np.random.normal(0, 1, size=(200, 2))
y = np.array(x[:,0] ** 2 + x[:,1] ** 2 < 1.5, dtype='int')

plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

对于这样一个样本集首先使用逻辑回归试一下效果如何？

log_reg = LogisticRegression()
log_reg.fit(x, y)
plot_decision_boundary(log_reg, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

显然有很多错误的分类，那接下来给逻辑回归添加多项式项。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomiaLogisticRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('log_reg', LogisticRegression())
    ])

poly_log_reg = PolynomiaLogisticRegression(degree=2)
poly_log_reg.fit(x, y)
poly_log_reg.score(x, y)
# 0.94999999999999996
plot_decision_boundary(poly_log_reg, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

那接下里把degree调大，试一下

poly_log_reg2 = PolynomiaLogisticRegression(degree=20)
poly_log_reg2.fit(x, y)

plot_decision_boundary(poly_log_reg2, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

出现这样的边界形状是因为发生了过拟合现象，degree越来模型越复杂，也就越容易发生过拟合。接下来我们就来解决这个过拟合问题。这里使用模型正则化。

八、逻辑回归中使用正则化

在实际的应用过程中，很少有问题直接用直线就能完成分类或者回归任务，因此正则化必不可少。之前学过模型泛化的时候提到的L1正则、L2正则化的方式：左边

mark

但是在sklearn中对逻辑回归中的正则化：右边

先使用直线Logistic Regression：

接下来看看sklearn中的逻辑回归是如何加入正则化的。还是先生成样本数据：

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x = np.random.normal(0, 1, size=(200, 2))
y = np.array(x[:,0] ** 2 + x[:,1] < 1.5, dtype='int')
# 添加一些噪音。
for i in range(20):
    y[np.random.randint(200)] = 1
    
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

x_train, x_test, y_train, y_test = train_test_split(x, y)
log_reg = LogisticRegression()
log_reg.fit(x, y)
# LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
#          intercept_scaling=1, max_iter=100, multi_class='warn',
#         n_jobs=None, penalty='l2', random_state=None, solver='warn',
#          tol=0.0001, verbose=0, warm_start=False)

通过输出结果我们可以发现默认C=1.0，这个C就是最开始提到的逻辑回归中正则中的C，penalty='l2'说明sklearn默认使用L2正则来进行模型正则化。

log_reg.score(x_train, y_train)
# 0.7933333333333333
log_reg.score(x_test, y_test)
# 0.7933333333333333
plot_decision_boundary(log_reg, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

使用多项式Logistic Regression

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomiaLogisticRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('log_reg', LogisticRegression())
    ])

poly_log_reg = PolynomiaLogisticRegression(degree=2)
poly_log_reg.fit(x_train, y_train)
poly_log_reg.score(x_train, y_train)
# 0.9133333333333333
poly_log_reg.score(x_test, y_test)
# 0.94
plot_decision_boundary(poly_log_reg, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

poly_log_reg2 = PolynomiaLogisticRegression(degree=20)
poly_log_reg2.fit(x_train, y_train)

poly_log_reg2.score(x_train, y_train)
# 0.94
poly_log_reg2.score(x_test, y_test)
# 0.92
plot_decision_boundary(poly_log_reg2, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

1、使用Logistic Regression L2正则

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

# 传入一个新的参数C
def PolynomiaLogisticRegression(degree, C):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('log_reg', LogisticRegression(C=C))
    ])

poly_log_reg3 = PolynomiaLogisticRegression(degree=20, C=0.1)
poly_log_reg3.fit(x, y)

poly_log_reg3.score(x_train, y_train)
# 0.8533333333333334
poly_log_reg3.score(x_test, y_test)
# 0.92
plot_decision_boundary(poly_log_reg3, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

2、使用Logistic Regression L1正则

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomiaLogisticRegression(degree, C, penalty='l2'):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('log_reg', LogisticRegression(C=C, penalty=penalty))
    ])

poly_log_reg4 = PolynomiaLogisticRegression(degree=20, C=0.1, penalty='l1')
poly_log_reg4.fit(x_train, y_train)

poly_log_reg4.score(x_train, y_train)
# 0.8266666666666667
poly_log_reg4.score(x_test, y_test)
# 0.9
plot_decision_boundary(poly_log_reg4, axis=[-4, 4, -4, 4])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.show()

mark

通过以上的例子，可以大致了解sklearn中已经封装了正则化的内容，但是在实际问题中我们并不知道degree，C，penalty这些超参数，因此就需要进行网格搜索进行寻优。

九、逻辑回归解决多分类问题

在开始之初，说逻辑回归只可以解决二分类问题，其实可以稍加改造使其能够解决多分类问题。当然这个改造方式并不是只针对逻辑回归这一种算法，这是一种通用的近乎于可以改造所有的二分类。

1、OvR

（One vs Rest）一对剩余，有些说法也叫（One vs All，OVA）。比如下图中的这个四分类问题，显然如果使用逻辑回归一条直线分出4类是不现实的，但是如果我们取出其中任意一种，将剩下的作为另一种，这种就是一个2分类问题，同理将每一个类别分别做一次这样的2分类，如果有n个类别就进行n次分类，选择分类得分最高的。就类似于下图这种，进行C（n，1）中分类。从而完成多分类。

mark

2、OvO

（One vs One）一对一，就是在多个类别中，先挑出2个来进行2分类，然后逐个进行，也就是C（n，2）中情况进行2分类，选择赢数最高的分类。比如一个手写数字的识别任务来说就要进行C（10,2）=45次分类，才能完成任务。很显然它相比OvR（n级别）来说，OvO这是一个n[^2]级别的，要比耗时更多，但是它分类的结果更加准确。这是因为每次都是在用两个真实的类别再进行2分类，而没有混淆其他的类别信息。

mark

3、Logistic Regression的OvR和OvO的编程实现

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def plot_decision_boundary(model, axis):
    x0, x1 = np.meshgrid(np.linspace(axis[0], axis[1], int((axis[1] - axis[0])*100)).reshape(1, -1),
                         np.linspace(axis[2], axis[3], int((axis[3] - axis[2])*100)).reshape(1, -1),)
    x_new = np.c_[x0.ravel(), x1.ravel()]
    y_predict = model.predict(x_new)
    zz = y_predict.reshape(x0.shape)
    
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#EF9A9A', '#FFF59D', '#90CAF9'])
    
    plt.contourf(x0, x1, zz, linewidth=5, cmap=custom_cmap)

iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)

log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
# LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
#          intercept_scaling=1, max_iter=100, multi_class='warn',
#          n_jobs=None, penalty='l2', random_state=None, solver='warn',
#          tol=0.0001, verbose=0, warm_start=False)

从输出结果来看，这个multi_class='warn',这个multi_class就是多分类的意思，从官方文档来看默认是OvR，不知道这个warn是什么意思。先不管，把这个参数传进去就是了。想具体了解每个参数的意义，请查看官方文档：from sklearn.linear_model import LogisticRegression

solver这个参数是因为在sklearn中并不是简单地使用梯度下降，因此我们需要给不同的方法传入不同的解决办法。

log_reg = LogisticRegression(multi_class='ovr',solver='liblinear')
log_reg.fit(x_train, y_train)
log_reg.score(x_test, y_test)
# 0.6578947368421053
plot_decision_boundary(log_reg, axis=[4, 8.5, 1.5, 4.5])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.scatter(x[y==2, 0], x[y==2, 1])
plt.show()

1565664636859

log_reg2 = LogisticRegression(multi_class='multinomial', solver='newton-cg')
log_reg2.fit(x_train, y_train)
# 0.7894736842105263
plot_decision_boundary(log_reg2, axis=[4, 8.5, 1.5, 4.5])
plt.scatter(x[y==0, 0], x[y==0, 1])
plt.scatter(x[y==1, 0], x[y==1, 1])
plt.scatter(x[y==2, 0], x[y==2, 1])
plt.show()

1565664611289

通过上面这两个示例发现准确率并不高，这是因为鸢尾花数据集共有4个特征，而只用了前两个,为了方便可视化，下面就是使用所有数据。

iris = datasets.load_iris()
x = iris.data[:, :]
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)

# OvR
log_reg3 = LogisticRegression(multi_class='ovr', solver='liblinear')
log_reg3.fit(x_train, y_train)
log_reg3.score(x_test, y_test)
# 0.9473684210526315

# OvO
log_reg4 = LogisticRegression(multi_class='multinomial', solver='newton-cg')
log_reg4.fit(x_train, y_train)
log_reg4.score(x_test, y_test)
# 1.0

4、sklearn中的OvR和OvO

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

from sklearn.multiclass import OneVsRestClassifier

ovr = OneVsRestClassifier(log_reg)
ovr.fit(x_train, y_train)
# OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, 
# dual=False, fit_intercept=True,
#          intercept_scaling=1, max_iter=100, multi_class='ovr',
#          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
#          tol=0.0001, verbose=0, warm_start=False),
#          n_jobs=None)
ovr.score(x_test, y_test)
# 0.9473684210526315

from sklearn.multiclass import OneVsOneClassifier

ovo = OneVsOneClassifier(log_reg)
ovo.fit(x_train, y_train)
# OneVsOneClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, 
# dual=False, fit_intercept=True,
#          intercept_scaling=1, max_iter=100, multi_class='multinomial',
#          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
#          tol=0.0001, verbose=0, warm_start=False),
#          n_jobs=None)

我是尾巴

本节就先到这。

本次推荐：

一个帮助开发者成长的社区

毒鸡汤：孩子，穷怎么了？穷也要挺起胸膛来，让别人看看，你不仅穷而且还矮。

矮又如何？抬起你的头来，让他们知道，你不仅矮，而且还丑。

丑不要紧，用你的言谈举止让其他人明白，你还是一个没有内涵的人。

没有内涵也不要放弃，从现在开始学习。当你读了足够多的书的时候，你会发现自己还笨。

mark

继续加油~