ML--k近邻算法

ML–k近邻算法

本节内容：

k近邻分类算法
从文本文件中解析和导入数据
使用Matplotlib创建扩散图
归一化数值

一.K近邻算法概述

简单地说,k近邻算法采用测量不同特征值之间的距离方法进行分类

k近邻算法

优点：精度高,对异常值不敏感,无数据输入假定

缺点：计算复杂度高,空间复杂度高

适用数据范围：数值型和标称型

使用k近邻算法分类爱情片和动作片,根据电影的打斗镜头和接吻镜头,确定是爱情片还是动作片？

from IPython.display import Image
Image(filename="./data/2_1.png",width=500)

首先我们需要知道这个未知电影存在多少个打斗镜头和接吻镜头,"?"是该未知电影出现的镜头数图形化展示

电影名称	打斗镜头	接吻镜头	电影类型
California Man	3	104	爱情片
He’s Not Really into Dudes	2	100	爱情片
Beautiful Woman	1	81	爱情片
Kevin Longblade	101	10	动作片
Robo Slayer 3000	99	5	动作片
Amped II	98	2	动作片
?	18	90	未知

即使不知道未知电影属于哪种类型,我们也可以通过某种方法计算出来.首先计算未知电影与样本集中其他电影的距离

电影名称	与未知电影的距离
Cafifornia Man	20.5
He’s Not Really into Dudes	18.7
Beautiful Woman	19.2
Kevin Longblade	115.3
Robo Slayer 3000	117.4
Amped II	118.9

现在我们得到了样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到k个距离最近的电影.假定k=3则三个最靠近的电影依次是He’s Not Really into Dudes,Beautiful Woman和California Man.k近邻算法按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片

k近邻算法的一般流程

收集数据：可以使用任何方法
准备数据：距离计算所需要的数值
分析数据：可以使用任何方法
训练算法：此步骤不适用于k近邻算法
测试算法：计算错误率
使用算法：首先需要输入样本数据和结构化的输出结果,然后运行k近邻算法判定输入数据分别属于哪个分类,最后应用对计算出的分类执行后续的处理

1.准备：使用python导入数据

import numpy as np
import operator

def createDataSet():
    dataset=np.array([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2]])
    labels=["爱情片","爱情片","爱情片","动作片","动作片","动作片"]
    return dataset,labels

dataset,labels=createDataSet()

dataset

array([[  3, 104],
       [  2, 100],
       [  1,  81],
       [101,  10],
       [ 99,   5],
       [ 98,   2]])

labels

['爱情片', '爱情片', '爱情片', '动作片', '动作片', '动作片']

向量labels包含了每个数据点的标签信息,labels包含的元素个数等于dataset矩阵行行数.红色点是爱情片,蓝色点是动作片

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

plt.plot([3,2,1],[104,100,81],"ro",[101,99,98],[10,5,2],"b^")

[<matplotlib.lines.Line2D at 0x2075b9f8358>,
 <matplotlib.lines.Line2D at 0x2075b9f8470>]

2.实施KNN分类算法

对未知类比属性的数据集中的每个点依次执行以下操作：

计算已知类别数据集中的每个点依次执行以下操作
按照距离递增次序排序
选取与当前点距离最小的k个点
确定前k个点所在类别的出现频率
返回前k个点出现频率最高的类别作为当前点的预测分类

def classMovieTest(X,dataset,labels,k):
    """
    :param x: 用于分类的输入向量
    :param dataset: 输入的训练样本集
    :param labels: 标签向量
    :param k: 用于选择最近邻居的数目
    :return: 分类标签;与已知样本的距离
    """
    
    # 距离计算
    datasetSize=dataset.shape[0]
    datasetMat=np.tile(X,(datasetSize,1))-dataset
    sqdatasetMat=datasetMat**2
    sqDistances=sqdatasetMat.sum(axis=1)
    distances=sqDistances**0.5
    sortDistIndicies=distances.argsort()
    classcount={}
    for i in range(k):
        voteLabel=labels[sortDistIndicies[i]]
        # 选择距离最小的 k个点
        classcount[voteLabel]=classcount.get(voteLabel,0)+1
        
    # 排序
    sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
    return sortClasscount[0][0],distances

预测数据所在分类,输入X=[18,90],其输出结果应该与上面分析一致

classMovieTest([18,90],dataset,labels,3)

('爱情片', array([ 20.51828453,  18.86796226,  19.23538406, 115.27792503,
        117.41379817, 118.92854998]))

二.使用k近邻算法改进约会网站的配对效果

三种类型的人：

不喜欢的人
魅力一般的人
极具魅力的人

1.准备数据：从文本文件中解析数据

数据放在文本文件datingTestSet2.txt中,每个样本数据占据一行,总共有1000行.样本主要包含以下3种特征：

每年获得的飞行常客里程数
玩视频游戏所耗时间百分比
每周消费的冰淇淋公升数

创建名为fileTmatrix的函数,以此来处理输入格式问题.该函数的输入为文件名字符串,输出为训练样本矩阵和类标签向量

def fileTmatrix(filename):
    """
    :param filename: 数据集文件名
    :return: 训练数据矩阵;类标签向量
    """
    fr=open(filename)
    arrayLines=fr.readlines()
    
    # 得到文件行数
    numberLines=len(arrayLines)
    
    # 创建返回的Numpy矩阵
    datasetMat=np.zeros((numberLines,3))
    classLabelVector=[]
    index=0
    
    # 解析文件数据到列表
    for line in arrayLines:
        line=line.strip()
        listFromLine=line.split("	")
        datasetMat[index,:]=listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index+=1
    return datasetMat,classLabelVector

dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")

dataMat

array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
       [1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
       [2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
       ...,
       [2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
       [4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
       [4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])

dataLabels[0:20]

[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

2.分析数据：使用Matplotlib创建散点图

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

plt.plot(dataMat[:,1],dataMat[:,2],"bo")

plt.xlabel("Percentage of Time Spent Playing Video Games")
plt.ylabel("Liters of ice cream consumed per week")

plt.show()

Matplotlib库提供的scatter函数支持个性化标记散点图上的点

fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(dataMat[:,1],dataMat[:,2],15.0*np.array(dataLabels),15.0*np.array(dataLabels))

<matplotlib.collections.PathCollection at 0x2075c05ea58>

使用数据矩阵dataMat的第一和第二列属性却可以得到更好的效果,图中清晰地标识了三个不同的样本分类区域,具有不同爱好的人其类别区域也不同

fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(dataMat[:,0],dataMat[:,1],15.0*np.array(dataLabels),15.0*np.array(dataLabels))

<matplotlib.collections.PathCollection at 0x2075d1d50b8>

3.准备数据：归一化数值

将取值范围的特征值转化为0到1区间内的值：

newValue=(oldValue-min)/(max-min)

使用函数Norm将数字特征值转化为0到1的区间

def Norm(dataset):
    """
    :param dataset: 数据集
    :return: 归一化数据集;极值差;最小值
    """
    
    # 参数0使得函数可以从列中选取最小值
    minVal=dataset.min(0)
    maxVal=dataset.max(0)
    ranges=maxVal-minVal
    normDataset=np.zeros(np.shape(dataset))
    m=dataset.shape[0]
    normDataset=dataset-np.tile(minVal,(m,1))
    
    # 特征值相除
    normDataset=normDataset/np.tile(ranges,(m,1))
    return normDataset,ranges,minVal

normMat,ranges,minVal=Norm(dataMat)

normMat

array([[0.44832535, 0.39805139, 0.56233353],
       [0.15873259, 0.34195467, 0.98724416],
       [0.28542943, 0.06892523, 0.47449629],
       ...,
       [0.29115949, 0.50910294, 0.51079493],
       [0.52711097, 0.43665451, 0.4290048 ],
       [0.47940793, 0.3768091 , 0.78571804]])

ranges

array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])

minVal

array([0.      , 0.      , 0.001156])

4.测试算法：作为完整程序验证分类器

def classMovieTest(X,dataset,labels,k):
    """
    :param x: 用于分类的输入向量
    :param dataset: 输入的训练样本集
    :param labels: 标签向量
    :param k: 用于选择最近邻居的数目
    :return: 分类标签
    """
    
    # 距离计算
    datasetSize=dataset.shape[0]
    datasetMat=np.tile(X,(datasetSize,1))-dataset
    sqdatasetMat=datasetMat**2
    sqDistances=sqdatasetMat.sum(axis=1)
    distances=sqDistances**0.5
    sortDistIndicies=distances.argsort()
    classcount={}
    for i in range(k):
        voteLabel=labels[sortDistIndicies[i]]
        # 选择距离最小的 k个点
        classcount[voteLabel]=classcount.get(voteLabel,0)+1
        
    # 排序
    sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
    return sortClasscount[0][0]

def classTest():
    haRatio=0.10
    dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
    normMat,ranges,minvals=Norm(dataMat)
    m=normMat.shape[0]
    numTestVecs=int(m*haRatio)
    errorcount=0.0
    
    for i in range(numTestVecs):
        classifierResult=classMovieTest(normMat[i,:],normMat[numTestVecs:m,:],dataLabels[numTestVecs:m],3)
        print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,dataLabels[i]))
        if (classifierResult!=dataLabels[i]):
            errorcount+=1.0
    print("The total error rate is:%d"%errorcount)
    print("The total error rate is:%f"%(errorcount/numTestVecs))

classTest()

The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:1
The total error rate is:5
The total error rate is:0.050000

假设我们使用全部的训练集来进行训练,看是否能提高准确率？

def classTest2():
    dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
    normMat,ranges,minvals=Norm(dataMat)
    m=normMat.shape[0]
    errorcount=0.0
    
    for i in range(m):
        classifierResult2=classMovieTest(normMat[i,:],normMat[:,:],dataLabels[:],3)
        
        if (classifierResult2!=dataLabels[i]):
            errorcount+=1.0
    print("The total error rate:",(errorcount/m))

classTest2()

The total error rate: 0.027

结果表明,错误率从5%降低到2.7%,提高了准确率

5.使用算法：构建完整可用系统

def classifyPerson():
    resultList=["not at all","in small doses","in large doses"]
    percentTats=float(input("Percentage of time spent playing video games："))
    ffMiles=float(input("Frequent flier miles earned per year："))
    iceCream=float(input("liters of ice cream consumed per year："))
    datingDataMat,datingLabels=fileTmatrix("./data/datingTestSet2.txt")
    normMat,ranges,minvals=Norm(datingDataMat)
    inArr=np.array([ffMiles,percentTats,iceCream])
    classifierResult=classMovieTest((inArr-minvals)/ranges,normMat,datingLabels,3)
    print("You will probably like thie person:",resultList[classifierResult-1])

classifyPerson()

Percentage of time spent playing video games： 10
Frequent flier miles earned per year： 10000
liters of ice cream consumed per year： 0.5


You will probably like thie person: in small doses

三.手写识别系统

构造系统识别数字0到9.处理成具有相同的色彩和大小：宽高是32*32的黑白图像

1.准备数据：将图像转换为测试向量

实际图像存储在trainingDigits中包含了大约2000个例子,每个数字大约有200个样本;目录testDigits中包含了大约900个测试数据

from IPython.display import Image

Image(filename="./data/2_2.png",width=500)

Image(filename="./data/2_3.png",width=500)

Image(filename="./data/2_4.png",width=500)

我们将把一个32_32的二进制图像矩阵转换为1_1024的向量.首先编写一段函数imgTvector,将图像转换为向量

def imgTvector(filename):
    returnVect=np.zeros((1,1024))
    fr=open(filename)
    for i in range(32):
        lineStr=fr.readline()
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    return returnVect

testVector=imgTvector("./data/digits/testDigits/0_13.txt")

testVector[0,0:31]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

2.测试算法：使用k近邻算法识别手写数字

from os import listdir

def handwritingClassTest():
    hwLabels=[]
    trainingFileList=listdir("./data/digits/trainingDigits/")
    m=len(trainingFileList)
    trainingMat=np.zeros((m,1024))
    for i in range(m):
        fileNameStr=trainingFileList[i]
        fileStr=fileNameStr.split(".")[0]
        classNumStr=int(fileStr.split("_")[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:]=imgTvector("./data/digits/trainingDigits/%s"%fileNameStr)
    testFileList=listdir("./data/digits/testDigits/")
    errorCount=0.0
    mTest=len(testFileList)
    for i in range(mTest):
        fileNameStr=testFileList[i]
        fileStr=fileNameStr.split(".")[0]
        classNumStr=int(fileStr.split("_")[0])
        vectorUnderTest=imgTvector("./data/digits/testDigits/%s"%fileNameStr)
        classifierResult=classMovieTest(vectorUnderTest,trainingMat,hwLabels,3)
        print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,classNumStr))
        if (classifierResult!=classNumStr):
            errorCount+=1.0
    print("The total number of errors is:%d"%errorCount)
    print("The total error rate is:%f"%(errorCount/float(mTest)))

handwritingClassTest()

The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
.
.
.
The classifier came back with:9,The real answer is:9
The classifier came back with:9,The real answer is:9
The classifier came back with:9,The real answer is:9
The total number of errors is:10
The total error rate is:0.010571

k近邻算法识别手写数字数据集,错误率为1.1%