推荐系统（五）

(原创文章，转载请注明出处！)

协同过滤算法（Collaborative Filtering）基于的根据是，类似的人会喜好类似的物品。如果将用户进行聚类，当来了新用户，将新用户归到相应的类，用类的评分结果来形成新用户的推荐结果。在文章 Collaborative Filtering Based on Iterative Principal Component Analysis 中给出了一种使用主成分分解技术对评分数据进行分解，然后对用户进行聚类的方法，算法的流程如下：

假设有M个用户对n个物品进行了评价，形成了 M x n的评分矩阵A。

1. 对矩阵A中的缺失值a_ij（用户没有对相应的物品进行评价）进行填充：使用i行和j列的平均值。

2. 对矩阵A进行均值normalization ：将每个评价减去其对应的列（物品）平均值。

3. 对矩阵A进行SVD分解：A = U∑V^T，选择前k个最大的奇异值，通过U_k∑_kV_k^T = A^'，得到原矩阵A的近似矩阵A^'

4. 将矩阵A中最初的缺失值，使用A^'中相应位置的值替代

5. 重复2、3、4直到收敛。收敛条件：相邻两次（l-1与l）迭代的缺失值的估计值的平方和之差与第(l-1)次的缺失值的估计值的平方和的比值小于某个常数（比如：10^-10）

（对最终的评分还需要进行均值normalization的逆操作，即：加上物品平均值）

通过以上的过程就将已有用户缺失的评分值预测出来。此外还通过上述的步骤的到主成分：U_k∑_k，一个M x k的矩阵，可以将这个矩阵作为训练数据使用K均值聚类来对用户进行聚类（也可以采用其他的聚类方法，如：RRC）。

聚类的类别数量选择条件：|S_k- S_k+1| / S₂≤ ε , ε是给定的常数，比如：0.001

S_k是k类的情况下，聚类完成后，各个样本点到各自的聚类中心的距离平方和

S_k+1是k+1类的情况下，聚类完成后，各个样本点到各自的聚类中心的距离平方和

S₂是2类的情况下，聚类完成后，各个样本点到各自的聚类中心的距离平方和

当来了一个新用户：

1. 选择最后的m-1行，形成一个（m-1） x n的子矩阵；（m < M）

2. 对一个新用户，将其对n个物品的评价向量添加到子矩阵的最后一行，形成m x n矩阵B

3. 对矩阵B，使用上述的迭代算法，将最后一行中的缺失值填上

4. 然后，计算出最后一行的PC

5. 计算最后一行PC到各个聚类中心的距离，决定新用户属于哪个类

6. 利用所选择类的平均值，来确实最后一行的缺失值，并形成最终的推荐结果

算法实现：

总体上来看，整个算法需要做的事情有：

1. 通过迭代的SVD分解，将已存在的用户对物品的评价缺失值，给补充上，并且计算出主成分

2. 通过一个聚类算法（K均值聚类），以原始评分数据的主成分为输入，进行用户聚类

3. 对新用户，使用迭代的SVD分解将新用户的缺失评分补充上，并计算新用户评分矩阵的主成分

4. 通过计算新用户的主成分与各聚类中心的距离，将新用户归类，使用类的评均值，作为新用户缺失评分的最终估计值

5. 寻找新用户缺失评分的Top-n，作为对新用户的推荐结果

R代码如下：

  1 ## normalize a vector with mean normanization ( x-u )
  2 ## substract column mean from each element
  3 ## Args :
  4 ##     x - a matrix
  5 ## Returns :
  6 ##     a list contains, mean of each colum, 
  7 ##                      normalized x
  8 meanNormalization <- function(x)
  9 {   
 10     meanOfcol <- numeric(dim(x)[2])
 11     sdOfcol <- numeric(dim(x)[2])
 12     for (i in 1:dim(x)[2]) {
 13         t <- x[,i]
 14         idx <- which(t != 0)  
 15         if (length(idx) < 1) {
 16             meanOfcol[i] <- NA
 17             next
 18         }
 19         meanOfcol[i] <- mean(t[idx])
 20         x[idx,i] <- t[idx] - mean(t[idx]) # mean normalization
 21     }
 22     
 23     return ( list(meanOfcol = meanOfcol, xNormalized=x) )
 24 }
 25 ## Args :
 26 ##     x  -  a vector
 27 ##     u  -  a real number, mean value
 28 ## Returns : 
 29 ##     a vector
 30 meanNormalizationInverse <- function(x, u)
 31 {
 32     return (x + u)
 33 }
 34 
 35 ## iterate with SVD to get the Principle Component of users and fill the missing rating
 36 ## Args :
 37 ##      x  -  a matrix, contain all rating reslut. 
 38 ##            Each row is the rating made by  one user, each colum is the rating of one item.
 39 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
 40 ##      pc_threshold  -  threshold to choose the principal components
 41 ##      iterate_threshold  -  svd iteration threshold
 42 ## Returns :
 43 ##      a list, contains : a matrix, principle component of users;
 44 ##                         a matrix, contains all rating between all users and all items, no missing value .
 45 iterateSVD <- function( x, pc_threshold = 0.9, iterate_threshold = 10e-10 )
 46 {    
 47     A <- x
 48     ## fill the missing value with the average of ith row and jth colum
 49     A[which(is.na(A))] <- 0
 50     meanOfColumn <- colMeans(A)
 51     meanOfRow <- rowMeans(A)
 52     idx <- which(is.na(x))
 53     for (i in idx) {
 54         i <- i %% dim(x)[1]   # remainder
 55         j <- j %/% dim(x)[1] + 1
 56         if (i == 0) {
 57             i <- 12
 58         }
 59         A[i,j] <- mean( c(meanOfRow[i], meanOfColumn[j]) )
 60     }
 61     
 62     ## normalize the data by column
 63     normlizedResult <- meanNormalization( A )
 64     A <-  normlizedResult$xNormalized 
 65     
 66     ## iterate util convergence
 67     lastSquareSumLast <- 0.001 * iterate_threshold  # initialized by a small enough value
 68     which( TRUE ) {
 69         # svd decomposition
 70         svd_A <- svd(A)
 71         
 72         # find the top-k singular value
 73         numTopSV <- 0
 74         for(sv in svd_A$d) {
 75             numTopSV <- numTopSV + 1
 76             if ( (sum(svd_A$d[1:numTopSV]) / sum(svd_A$d)) >= pc_threshold ) {
 77                 break
 78             }
 79         }
 80         
 81         # approximation of A
 82         A_appro <- svd_A$u[,1:numTopSV] %*% diag(svd_A$d[1:numTopSV]) %*% t(svd_A$v[,1:numTopSV])
 83         
 84         # filling the missing value with the corresponding value in A_appro
 85         A[which(is.na(x))] <-  A_appro[which(is.na(x))]
 86         
 87         # checking convergence, or not
 88         if ( ( (sum(A_appro[which(is.na(x))]^2) - lastSquareSumLast) / lastSquareSumLast )
 89              < iterate_threshold ) {
 90             break
 91         }
 92         
 93         lastSquareSumLast <- sum(A_appro[which(is.na(x))]^2)
 94     }
 95     
 96     ## calculate the principle component of user
 97     pcOfUsers <- A %*% svd_A$v[,1:numTopSV]
 98     
 99     normalizedA <- A
100     
101     ## reverse the normalization 
102     for (i in 1:dim(A)[2]) {
103         A[,i] <- meanNormalizationInverse(A[,i], normlizedResult$meanOfcol)
104     }
105     
106     return ( list( pcOfUsers = pcOfUsers, fullRatingResult = A) )
107 }
108 
109 ## cluster the users with k-means cluster method
110 ## Args :
111 ##     pc  -  a matrix, principle component of users, each row is one user
112 ##     clusterThreshold  -  a constant, which is used to choose the 
113 ##                          best user cluster number
114 ## Returns :
115 ##     a list, contains : user k-means cluster object
116 ##                        user cluster number
117 clusterUser <- function(pc, clusterThreshold = 0.001)
118 {
119     clusterNum <- 2
120     uc2 <- kmeans(x, centers = clusterNum)
121     lastKM <- uc2
122     
123     ## to make the best choice of cluster number
124     ## iterate with increasing cluster number until convergence
125     while( TRUE ) {
126         clusterNum <- clusterNum + 1
127         uci <- kmeans(x, centers = clusterNum)
128         
129         # check convergence 
130         if ( ( abs(lastKM$tot.withinss - uci$tot.withinss) / uc2$tot.withinss )
131              >=  clusterThreshold )   {
132             break
133         }
134         lastKM <- uci
135     }
136     
137     return ( list(userCluster = uci, clusterNum = clusterNum) )
138 }
139 
140 ## get the recomendation list to new user
141 ## Args :
142 ##     ur  -  a vector, rating of items from a new user
143 ##     x  -  a matrix, contain all rating reslut. 
144 ##           Each row is the rating made by  one user, each colum is the rating of one item.
145 ##           If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
146 ##     m  -  a integer, number of old users that will be used to fill the
147 ##                      missing value of vector ur with iterate svd method
148 ##     topn  -  a integer, how many items will be recommended to the new user.
149 ## Returns :
150 ##     a list, contains recommendation result
151 recommendToNewUser <- function(ur, x, m, topn)
152 {
153     iterSVD_x <- iterateSVD(x)
154     cluster_x <- clusterUser( iterSVD_x$pcOfUsers )
155     
156     A <- iterSVD_x$fullRatingResult    
157     B <- rbind(A[(dim(A)[1] - m + 1) : dim(A)[1],], ur)
158     
159     iterSVD_B <- iterateSVD(B)
160     pc_ur <- iterSVD_B$pcOfUsers[m+1, ] # the principle component of new user
161     
162     ## choose the cluster of new user
163     clusterCenter <- cluster_x$userCluster$centers
164     distToCC <- numeric(cluster_x$clusterNum)
165     for (i in 1:cluster_x$clusterNum) {
166         distToCC[1] <- sqrt( sum( (pc_ur - clusterCenter[i, ]) ^2 ) )
167     }
168     # pick the minimum one
169     clusterOfNU <- which( distToCC == min(distToCC) )
170     
171     ## calculate the average value to fill the missing value in ur
172     idx <- which( cluster_x$cluster == clusterOfNU)
173     ratingOfNUCluster <- A[idx,]
174     idx_missing <- which(is.na(ur))
175     for (j in idx_missing) {
176         ur[j] <- mean( ratingOfNUCluster[,j] )
177     }
178     
179     ## pick the top-n recommendation items
180     pickTopn <- numeric( length(ur) )
181     pickTopn <- ur[idx_missing]
182     topnIdx <- apply( matrix(pickTopn,nrow=1), MARGIN=1, 
183                      FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), topn )  )
184     topnIdx <- as.vector(topnIdx)
185     recommendList <- list(ratingResult = pickTopn[topnIdx], topnIndex = topnIdx)
186     return( recommendList )
187 }