机器学习实战-学习笔记-第十四章

1.将代码拷贝到F:studioMachineLearningInActionch14下

2.启动ipython

3.在ipython中改变工作目录到F:studioMachineLearningInActionch14

In [17]: cd F:\studio\MachineLearningInAction\ch14
F:studioMachineLearningInActionch14

4.在工作目录下新建一个svdRec.py文件并加入如下代码：

from numpy import *
from numpy import linalg as la

def loadExData():
    return[[0, 0, 0, 2, 2],
           [0, 0, 0, 3, 3],
           [0, 0, 0, 1, 1],
           [1, 1, 1, 0, 0],
           [2, 2, 2, 0, 0],
           [5, 5, 5, 0, 0],
           [1, 1, 1, 0, 0]]

5.进行SVD分解并验证分解结果：

In [18]: import svdRec

In [19]: Data=svdRec.loadExData()

In [20]: U,Sigma,VT=linalg.svd(Data)

In [21]: Sigma
Out[21]:
array([  9.64365076e+00,   5.29150262e+00,   9.11145502e-16,
         1.40456183e-16,   3.09084552e-17])

In [22]: Sig2=mat([[Sigma[0],0],[0,Sigma[2]]])

In [23]: Sig2
Out[23]:
matrix([[  9.64365076e+00,   0.00000000e+00],
        [  0.00000000e+00,   9.11145502e-16]])

In [24]: Sig2=mat([[Sigma[0],0],[0,Sigma[1]]])

In [25]: Sig2
Out[25]:
matrix([[ 9.64365076,  0.        ],
        [ 0.        ,  5.29150262]])

In [26]: U[:,:2]*Sig2*VT[:2,:]
Out[26]:
matrix([[ -1.36157966e-16,  -8.59140046e-16,  -8.59140046e-16,
           2.00000000e+00,   2.00000000e+00],
        [  7.22982080e-16,  -3.61491040e-16,  -3.61491040e-16,
           3.00000000e+00,   3.00000000e+00],
        [  2.40994027e-16,  -1.20497013e-16,  -1.20497013e-16,
           1.00000000e+00,   1.00000000e+00],
        [  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          -8.60707644e-18,  -8.60707644e-18],
        [  2.00000000e+00,   2.00000000e+00,   2.00000000e+00,
          -1.72141529e-17,  -1.72141529e-17],
        [  5.00000000e+00,   5.00000000e+00,   5.00000000e+00,
          -1.39716789e-16,  -1.39716789e-16],
        [  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          -8.60707644e-18,  -8.60707644e-18]])

可以看出，U[:,:2]*Sig2*VT[:2,:]是对原来的Data矩阵的一个非常好的近似。

6.在svdRec.py中加入如下代码:

def ecludSim(inA,inB):
    return 1.0/(1.0 + la.norm(inA - inB))

def pearsSim(inA,inB):
    if len(inA) < 3 : return 1.0
    return 0.5+0.5*corrcoef(inA, inB, rowvar = 0)[0][1]

def cosSim(inA,inB):
    num = float(inA.T*inB)
    denom = la.norm(inA)*la.norm(inB)
    return 0.5+0.5*(num/denom)

上述代码定义了三种不同的相似度量

7.利用朴素的基于相似度的推荐方法建议推荐结果

In [44]: reload(svdRec)
Out[44]: <module 'svdRec' from 'svdRec.py'>

In [45]: myMat=mat(svdRec.loadExData())

In [46]: myMat
Out[46]:
matrix([[0, 0, 0, 2, 2],
        [0, 0, 0, 3, 3],
        [0, 0, 0, 1, 1],
        [1, 1, 1, 0, 0],
        [2, 2, 2, 0, 0],
        [5, 5, 5, 0, 0],
        [1, 1, 1, 0, 0]])

In [47]: myMat[0,1]=myMat[0,0]=myMat[1,0]=myMat[2,0]=4

In [48]: myMat[3,3]=2

In [49]: myMat
Out[49]:
matrix([[4, 4, 0, 2, 2],
        [4, 0, 0, 3, 3],
        [4, 0, 0, 1, 1],
        [1, 1, 1, 2, 0],
        [2, 2, 2, 0, 0],
        [5, 5, 5, 0, 0],
        [1, 1, 1, 0, 0]])

In [50]: svdRec.recommend(myMat,2)
the 1 and 0 similarity is: 1.000000
the 1 and 3 similarity is: 0.928746
the 1 and 4 similarity is: 1.000000
the 2 and 0 similarity is: 1.000000
the 2 and 3 similarity is: 1.000000
the 2 and 4 similarity is: 0.000000
Out[50]: [(2, 2.5), (1, 2.0243290220056256)]


In [53]: svdRec.recommend(myMat,2,simMeas=svdRec.ecludSim)
the 1 and 0 similarity is: 1.000000
the 1 and 3 similarity is: 0.309017
the 1 and 4 similarity is: 0.333333
the 2 and 0 similarity is: 1.000000
the 2 and 3 similarity is: 0.500000
the 2 and 4 similarity is: 0.000000
Out[53]: [(2, 3.0), (1, 2.8266504712098603)]

In [54]: svdRec.recommend(myMat,2,simMeas=svdRec.pearsSim)
the 1 and 0 similarity is: 1.000000
the 1 and 3 similarity is: 1.000000
the 1 and 4 similarity is: 1.000000
the 2 and 0 similarity is: 1.000000
the 2 and 3 similarity is: 1.000000
the 2 and 4 similarity is: 0.000000
Out[54]: [(2, 2.5), (1, 2.0)]

8.利用SVD提高推荐的效果

在svdRec代码中个加入如下代码:

def loadExData2():
    return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]

上面的矩阵比较稀疏。现在计算该矩阵进行SVD分解需要多少维特征

In [57]: reload(svdRec)
Out[57]: <module 'svdRec' from 'svdRec.py'>

In [58]: U,Sigma,VT=la.svd(mat(svdRec.loadExData2()))

In [59]: Sigma
Out[59]:
array([ 15.77075346,  11.40670395,  11.03044558,   4.84639758,
         3.09292055,   2.58097379,   1.00413543,   0.72817072,
         0.43800353,   0.22082113,   0.07367823])

In [60]: mat(svdRec.loadExData2())
Out[60]:
matrix([[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
        [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
        [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
        [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
        [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
        [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
        [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
        [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
        [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
        [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
        [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]])

In [61]: Sig2=Sigma**2

In [62]: Sig2
Out[62]:
array([  2.48716665e+02,   1.30112895e+02,   1.21670730e+02,
         2.34875695e+01,   9.56615756e+00,   6.66142570e+00,
         1.00828796e+00,   5.30232598e-01,   1.91847092e-01,
         4.87619735e-02,   5.42848136e-03])

In [63]: sum(Sig2)
Out[63]: 541.99999999999955

In [64]: sum(Sig2)*0.9
Out[64]: 487.79999999999961

In [65]: sum(Sig2[:2])
Out[65]: 378.82955951135784

In [66]: sum(Sig2[:3])
Out[66]: 500.50028912757921

9.基于SVD进行评分：

在svdRec中加入如下代码:

def svdEst(dataMat, user, simMeas, item):
    n = shape(dataMat)[1]
    simTotal = 0.0; ratSimTotal = 0.0
    U,Sigma,VT = la.svd(dataMat)
    Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
    xformedItems = dataMat.T * U[:,:4] * Sig4.I  #create transformed items
    for j in range(n):
        userRating = dataMat[user,j]
        if userRating == 0 or j==item: continue
        similarity = simMeas(xformedItems[item,:].T,
                             xformedItems[j,:].T)
        print 'the %d and %d similarity is: %f' % (item, j, similarity)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0: return 0
    else: return ratSimTotal/simTotal

它定义了基于SVD的相似度评分

10.测试效果

In [69]: myMat=mat(svdRec.loadExData2())

In [70]: svdRec.recommend(myMat,1,estMethod=svdRec.svdEst)
the 0 and 3 similarity is: 0.490950
the 0 and 5 similarity is: 0.484274
the 0 and 10 similarity is: 0.512755
the 1 and 3 similarity is: 0.491294
the 1 and 5 similarity is: 0.481516
the 1 and 10 similarity is: 0.509709
the 2 and 3 similarity is: 0.491573
the 2 and 5 similarity is: 0.482346
the 2 and 10 similarity is: 0.510584
the 4 and 3 similarity is: 0.450495
the 4 and 5 similarity is: 0.506795
the 4 and 10 similarity is: 0.512896
the 6 and 3 similarity is: 0.743699
the 6 and 5 similarity is: 0.468366
the 6 and 10 similarity is: 0.439465
the 7 and 3 similarity is: 0.482175
the 7 and 5 similarity is: 0.494716
the 7 and 10 similarity is: 0.524970
the 8 and 3 similarity is: 0.491307
the 8 and 5 similarity is: 0.491228
the 8 and 10 similarity is: 0.520290
the 9 and 3 similarity is: 0.522379
the 9 and 5 similarity is: 0.496130
the 9 and 10 similarity is: 0.493617
Out[70]: [(4, 3.3447149384692283), (7, 3.3294020724526971), (9, 3.3281008763900695)]

In [71]: svdRec.recommend(myMat,1,estMethod=svdRec.svdEst,simMeas=svdRec.pearsSim)
the 0 and 3 similarity is: 0.341942
the 0 and 5 similarity is: 0.124132
the 0 and 10 similarity is: 0.116698
the 1 and 3 similarity is: 0.345560
the 1 and 5 similarity is: 0.126456
the 1 and 10 similarity is: 0.118892
the 2 and 3 similarity is: 0.345149
the 2 and 5 similarity is: 0.126190
the 2 and 10 similarity is: 0.118640
the 4 and 3 similarity is: 0.450126
the 4 and 5 similarity is: 0.528504
the 4 and 10 similarity is: 0.544647
the 6 and 3 similarity is: 0.923822
the 6 and 5 similarity is: 0.724840
the 6 and 10 similarity is: 0.710896
the 7 and 3 similarity is: 0.319482
the 7 and 5 similarity is: 0.118324
the 7 and 10 similarity is: 0.113370
the 8 and 3 similarity is: 0.334910
the 8 and 5 similarity is: 0.119673
the 8 and 10 similarity is: 0.112497
the 9 and 3 similarity is: 0.566918
the 9 and 5 similarity is: 0.590049
the 9 and 10 similarity is: 0.602380
Out[71]: [(4, 3.3469521867021736), (9, 3.3353796573274703), (6, 3.307193027813037)]

In [72]: