Similarity metrics(Updated Aug,8th)

Here is a link that explains the cosine similarity and cosine pairwise distances.

https://stackoverflow.com/questions/35281691/scikit-cosine-similarity-vs-pairwise-distances

So the codes in the first tutorial may be wrong.It misuse distances and similarities.

https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html

This is some simple tests:

import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from math import *
from sklearn.metrics.pairwise import cosine_similarity
#construct a matrix
mat = np.zeros((5,10))
mat = np.matrix(
    [[2, 3, 0, 0, 0, 0, 5, 0, 1, 0],
     [20,30,0, 0, 0, 0, 50,0, 10,0],
     [1, 7, 0, 0, 0, 0, 2, 0, 8, 0],
     [2, 3, 0, 0, 0, 0, 0, 0, 1, 0],
     [4, 6, 0, 0, 7, 0, 0, 0, 2, 0]])
#row is user, col is venue, intersections is checkin frequencies
user_dis = pairwise_distances(mat,metric='cosine')
user_sim = cosine_similarity(mat)
user_dis
Out[3]: 
array([[ 0.        ,  0.        ,  0.39561935,  0.40085531,  0.56244658],
       [ 0.        ,  0.        ,  0.39561935,  0.40085531,  0.56244658],
       [ 0.39561935,  0.39561935,  0.        ,  0.23729486,  0.44299892],
       [ 0.40085531,  0.40085531,  0.23729486,  0.        ,  0.26970326],
       [ 0.56244658,  0.56244658,  0.44299892,  0.26970326,  0.        ]])
user_sim
Out[4]: 
array([[ 1.        ,  1.        ,  0.60438065,  0.59914469,  0.43755342],
       [ 1.        ,  1.        ,  0.60438065,  0.59914469,  0.43755342],
       [ 0.60438065,  0.60438065,  1.        ,  0.76270514,  0.55700108],
       [ 0.59914469,  0.59914469,  0.76270514,  1.        ,  0.73029674],
       [ 0.43755342,  0.43755342,  0.55700108,  0.73029674,  1.        ]])

We can see that the most similar(the same) items' cosine distance is 0 and their similarity is 1.

To be more clear we will use cosine_similaity function in the future.

And from the artificial matrix, we can see that cosine_similarity deals well with some kinds of situations, like usr[0] and usr[1], they two have a very similar taste, except that the frequency of usr[1] is 10 times of usr[0]. And cosine similarity thinks their similarity is one! This is consistent with human recognition.

As for other comparisons of usr[0] and other users similarity:

usr[2]≈usr[3]>usr[4]

usr[2] goes to all the places usr[0] has gone to, the only difference is that they have different frequencies, usr[3] left out on place[6] but usr[3]'s visiting frequency is actually the same as usr[0].

I think it is quite reasonable to get such a result, so using cosine_similarity may reflect the relationship between users very well.

原文地址:https://www.cnblogs.com/fassy/p/7307131.html