机器学习实战—逻辑回归

逻辑回归进行分类

有这么几个关键词：最佳拟合曲线，利用逻辑回归模型进行分类，最佳拟合参数（w和b），sigmoid函数（S函数），最优化方法，梯度上升（下降），随机梯度下降法，参数迭代公式（梯度，步长）
梯度上升找到最佳参数：
伪代码：
每个回归系数初值定为1
for i in steps:
计算整个数据集的梯度
w=w+alpha*梯度
更新回归系数

#梯度上升算法的实现
import numpy as np
def loadDataSet():
    dataMat=[]#创建数据空列表
    labelMat=[]#创建类别空列表
    fr=open('logRegres_testSet.txt')#打开训练集文本
    for line in fr.readlines():#读取每行
        lineArr=line.strip().split()#分割数据
        #合并每行数据（不包括类别列）,并在每一行的前面添加一列1
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        labelMat.append(int(lineArr[2]))#合并类别
    return dataMat,labelMat

#定义S函数
def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))

#定义梯度上升算法
def gradAscent(dataMatIn,classLabels):
    dataMatrix=np.mat(dataMatIn)#list型转换为mat型（矩阵型）
    labelMat=np.mat(classLabels).transpose()#list型转换为mat型并转置
    m,n=np.shape(dataMatrix)#得到矩阵的行列数
    alpha=0.001#定义步长
    maxCycles=500#定义迭代步数
    weights=np.ones((n,1))#定义参数矩阵初值（全1）
    for k in range(maxCycles):#步数循环
        h=sigmoid(dataMatrix*weights)#求S函数的值
        error=(labelMat-h)#求误差
        weights=weights+alpha*dataMatrix.transpose()*error#更新参数数值
    return weights

#画出决策边界
def plotBestFit(wei):
    import matplotlib.pyplot as plt
    weights=wei.getA()#getA():返回自己，但是作为ndarray返回,即一个矩阵
    dataMat,labelMat=loadDataSet()#得到数据和类别
    dataArr=np.array(dataMat)#得到矩阵
    n=np.shape(dataArr)[0]#得到行数
    xcord1=[];ycord1=[]
    xcord2=[];ycord2=[]
    for i in range(n):#循环数据行数
        if int(labelMat[i])==1:
            #如果类别为1，将dataArr[i,1]的数据添加给xcord1,...
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        else:#否则，将dataArr[i,1]的数据添加给xcord2,...
            xcord2.append(dataArr[i,1])
            ycord2.append(dataArr[i,2])
    fig=plt.figure()
    #画出散点图，即训练集数据点
    ax=fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x=np.arange(-3.0,3.0,0.1)#创建-3到3，间隔为0.1的点
    y=(-weights[0]-weights[1]*x)/weights[2]#由x求出Y
    ax.plot(x,y)#绘制分界线
    plt.xlabel('X1');plt.ylabel('Y1')
    plt.show()

对上述梯度上升（下降）算法的解释，在这里有算法在每次更新回归系数时，都要遍历整个数据集。
命令窗口输入：

data,label=loadDataSet()

weights=gradAscent(data,label)

weights
Out[4]: 
matrix([[ 4.12414349],
        [ 0.48007329],
        [-0.6168482 ]])

plotBestFit(weights)

图形显示：
逻辑回归决策边界

训练算法-随机梯度上升（SGD）

对之前的梯度算法就行改进，一次仅用一个样本点来更新回归系数，就是随机梯度上升算法。
伪代码
所有回归系数初始化为1
对数据集中每个样本
计算该样本的梯度
w=w+alpha*梯度
更新回归系数

#s随机梯度上升算法
def stocGradAscent0(dataMatrix,classLabels): 
    m,n=np.shape(dataMatrix)
    alpha=0.001
    weights=np.ones(n)
    for i in range(m):
        h=sigmoid(sum(dataMatrix[i]*weights))
        error=classLabels[i]-h
        weights=weights+alpha * error * dataMatrix[i]
    #提前修改weights的类型
    temp=np.mat(weights)
    weights=temp.transpose()
    return weights
#随机梯度算法中，出现的数据都是数值格式，而之前的梯度算法中基本都是numpy数组格式
#在这里输出的weights时1*3的列表，需要在使用画图操作前进行np.mat()操作和转置操作
#而GD输出的weights是3*1的数组（array）

命令程序：

data,labels=loadDataSet()

weights=stocGradAscent0(np.array(data),labels)

weights
Out[82]: 
matrix([[ 0.963951  ],
        [ 0.9826866 ],
        [ 0.49153886]])

plotBestFit(weights)

SGD结果
发现结果并没有之前好的，原因在于之前的结果是迭代了500次之后，而现在的SGD仅仅遍历了数据行数。当然我们可以发现SGD中的步长因子是0.01，而不是0.001（GD），SGD在用0.001时效果更差。
下一步对SGD进行改进。

#改进的SGD算法
def stocGradAscent1(dataMatrix,classLabels,numIter=150):
    m,n=np.shape(dataMatrix)
    weights=np.ones(n)
    for j in range(numIter):#迭代次数
    #对于range(m)不存在的操作，需要用list(range(m))
        dataIndex=list(range(m))
        for i in range(m):
            alpha=4/(1.0+j+i)+0.01
            #随机选取一个样本点（第几行）
            randIndex=int(np.random.uniform(0,len(dataIndex)))
            h=sigmoid(sum(dataMatrix[randIndex]*weights))
            error=classLabels[randIndex]-h
            weights=weights+alpha*error*dataMatrix[randIndex]
            del(dataIndex[randIndex])#下次迭代前删除该值
    temp=np.mat(weights)
    weights=temp.transpose()
    return weights
 #改进后的算法在于这两处：1，每次迭代的时候alpha都会改变，不断减小但不会为0
 #通过随机选取样本更新回归系数，每次随机选一个，并在进行下一迭代前删掉该值

得到的结果是：
修改后的SGD150次迭代

上面是150迭代结果，下面是500次迭代结果
修改后的SGD500次迭代

实例：从疝气病病症预测病马的死亡率

数据的前期处理，如果数据（特征数据）存在缺失，应该采取的处理措施：
使用可用特征的均值填补缺失的特征数据，使用特殊值来填补缺失值，忽略有缺失值的样本，使用相似样本的均值填补缺失值，使用另外的机器学习算法预测缺失值。
如果是类别值缺失，那么该数据行应该丢弃，因为类别值很难确定用哪个类别来填补。
现在数据已经没有问题，可以利用来预测。前面的数据已经可以通过训练集得到回归系数，所以只要我们再利用测试集的数据乘上回归系数，再取S函数，就可以判断死亡与否，继而得到死亡率。

#利用以上算法从疝气病症预测病马的死亡率
#因为上面的算法已经得出了回归系数，所以只要将测试集的数据乘上回归系数代入S函数即可
def classifyVector(inX,weights):
    prob=sigmoid(sum(inX*weights))#S函数
    if prob>0.5:
        return 1.0
    else:
        return 0.0
    
def colicTest():
    frTrain=open('horseColicTraining.txt')#打开训练集
    frTest=open('horseColicTest.txt')#打开测试集
    trainingSet=[]
    trainingLabels=[]
    for line in frTrain.readlines():#遍历所有行数据
        currLine=line.strip().split('	')
        lineArr=[]
        #注：每一行数据有22个数据，其中最后一个表示类别
        for i in range(21):#因为有21个特征（0~20）,特征数据赋给lineArr
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)#今儿赋给trainingSet
        trainingLabels.append(float(currLine[21]))#类别数据转赋
    #通过训练集求回归系数，也即是求出映射关系
    trainWeights=stocGradAscent1(np.array(trainingSet),trainingLabels,500)
    errorCount=0#分类错误数初始化
    numTestVec=0.0#已测试数据数目初始化
    for line in frTest.readlines():#读取测试集每一行数据
        numTestVec+=1.0#测试数据数目递增
        currLine=line.strip().split('	')#每行数据的切割
        lineArr=[]
        for i in range(21):#只取每行数据前21个特征数据
            lineArr.append(float(currLine[i]))
        if int(classifyVector(np.array(lineArr),trainWeights))!=int(currLine[21]):
            errorCount+=1#如果测试出的类别与原测试集类别不同，则分类错误数递增
    errorRate=(float(errorCount)/numTestVec)#求错误率
    print("the error rate of this test is；%f" %errorRate)
    return errorRate
def multiTest():#求结果平均值
    numTests=10
    errorSum=0.0
    for k in range(numTests):
        errorSum+=colicTest()
    print("after %d iterations the average error rate is: %f" %(numTests,errorSum/float(numTests)))

要知道，利用python打开文件后，一般都要对数据进行格式化处理，基本上都是切割，定义两个空列表，分别将数据中的特征列和类别列添加到其中，即分成两个部分。
结果：

multiTest()
__main__:24: RuntimeWarning: overflow encountered in exp
the error rate of this test is；0.373134
the error rate of this test is；0.283582
the error rate of this test is；0.313433
the error rate of this test is；0.238806
the error rate of this test is；0.417910
the error rate of this test is；0.402985
the error rate of this test is；0.358209
the error rate of this test is；0.432836
the error rate of this test is；0.343284
the error rate of this test is；0.313433
after 10 iterations the average error rate is: 0.347761

总结：逻辑回归是一种分类算法，利用S函数，将参数和特征代入，实现分类。而任务主要在于寻找回归系数，用的方法是最优化。其中最常用的是梯度上升（下降），GD和SGD。SGD是一种在线学习算法，与之对应的一次处理所有数据被称为批处理。

invictus maneo!