推荐系统（二）

(原创文章，转载请注明出处！)

一、问题

实现对电影的推荐，数据集中有约1600部电影，有约900个用户对这些电影进行了评价。设每个电影有10个特征，根据推荐系统（一）描述的算法，每个用户也相应的有10的参数，那么总的参数个数 ≈ 1600 * 10 + 900 * 10 ≈ 16000 + 9000 ≈25000 个。电影评价的数据集可以访问链接：http://grouplens.org/datasets/movielens/

二、代码实现

用R编程实现推荐系统（一）描述的算法。在实现中使用R提供的一个求解最优化的函数，optim。推荐算法的优化目标是无约束最优化问题，使用optim时，选择使用共轭梯度算法求解无约束最优化问题。optim的参数需要两个函数：目标函数和目标函数的一阶导函数。

1. 目标函数的代码如下：

1 JThetaXLamda <- function(thetaX) {
2     theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE)  
3     X     <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE)
4     
5     return (  (1/2) * ( sum( (tcrossprod(X, theta) * R - Y)^2 )  
6                         + lamda * ( sum(theta^2) + sum(X^2) )  ) )
7 }

2. 目标函数的一阶导函数代码如下：

 1 JThetaXGrLamda <- function(thetaX) {
 2     theta <- matrix(thetaX[1:(nUser*nMovieFeature)], nrow=nUser, ncol=nMovieFeature, byrow=TRUE)  # 943 is the number of user,  each row is one user's parameter
 3     X     <- matrix(thetaX[(nUser*nMovieFeature+1):length(thetaX)], nrow=nMovie, ncol=nMovieFeature, byrow=TRUE)  # 1682 is the number of movie
 4      
 5     # gradient of theta
 6     grTheta <- matrix(numeric(),ncol=nMovieFeature)
 7     for (j in 1:nUser) { # jth user
 8         idx <- which(R[,j]==1)  # which movies are rated by jth user
 9         tmpX <- X[idx,] 
10         tmpY <- matrix(Y[idx,j], nrow=1)
11         grTheta <- rbind( grTheta, alpha * ( (tcrossprod(theta[j,], tmpX) - tmpY) %*% tmpX 
12                                                        + lamda * theta[j,] )  )
13     }
14 
15     # gradient of x
16     # calculating gradient of each movie only need the users who rate it.
17     grX <- matrix(numeric(),ncol=nMovieFeature)
18     for (i in 1:nMovie) { # ith movie
19         idx <- which(R[i,]==1)  # exclude the user who haven't rated movie i
20         tmpTheta <- theta[idx,] 
21         tmpY <- matrix(Y[i,idx], nrow=1) # tmpY is a vector including rating credit of movie i
22         grX <- rbind( grX, alpha * ( (tcrossprod(X[i,], tmpTheta) - tmpY) %*% tmpTheta 
23                                                + lamda * X[i,]) )
24     }
25     
26     return (   c( t(grTheta), t(grX) )   )
27 }

3. 训练：

上面两个函数使用的变量：nUser、nMovie、nMovieFeature、Y、R是自由变量，在训练代码中定义和赋值。

推荐系统（一）中还提到对数据需要做进行均值正规化（Mean Normalization）, 代码如下：

 1 ## normalizae the training data
 2 ## Args :
 3 ##     Y  -  a matrix, rating of movies by users 
 4 ##     R  -  a matrix, flags, which movies are rated by user
 5 ## Returns :
 6 ##     a list contains two elements: Ynormalized - a matrix, normalized Y
 7 ##                                   Ymean - a vector contains mean of each movie
 8 craEx8RS_meanNormalization <- function(Y, R)
 9 {
10     numOfMovies <- dim(Y)[1]
11     l <- list( Ynormalized = Y, Ymean = numeric(length = numOfMovies) )
12     for (i in 1:numOfMovies) {
13         idx <- which(R[i,] == 1)
14         meanOfOneMovie <- mean(Y[i,idx])
15         l$Ynormalized[i,] <- Y[i,] - meanOfOneMovie
16         l$Ymean[i] <- meanOfOneMovie
17     }
18     return( l )
19 }

训练代码如下：

 1 ## do the training
 2 Y <- read.table(file="Y.txt")
 3 R <- read.table(file="R.txt")
 4 Y <- as.matrix(Y)
 5 R <- as.matrix(R)
 6 l <- meanNormalization(Y, R)  # normalize the data
 7 Y <- l$Ynormalized
 8 nMovie <- dim(Y)[1]
 9 nUser  <- dim(Y)[2]
10 nMovieFeature <- 10 
11 lamda <- 0.01
12 alpha <- 0.01
13 
14 
15 ## random initialize theta, X
16 theta <- matrix(runif(nUser*nMovieFeature,min=0,max=0.1), nrow=nUser, ncol=nMovieFeature)  # each row is parameter of one user
17 X     <- matrix(runif(nMovie*nMovieFeature,min=0,max=0.1), nrow=nMovie, ncol=nMovieFeature)  # each row is parameter of one movie
18 # convert the two initialize value matrix to vector
19 thetaX <- c( t(theta), t(X) ) 
20 # using conjugate gradient algorithm to train the parameters
21 optimCG <- optim( par=thetaX, fn=JThetaX, gr = JThetaXGr, method = "CG", control = list(maxit=500000, abstol=1e-6) )
22 optimCG$par # display the training reslut, i.e. best value has been found for parameters

4. 预测：
通过训练得到参数值后，就可以通过推荐系统（一）中提到的计算公式: θ^T*x + 平均值, 来计算每个用户对还没有评价电影的评分。通过比较计算得到评分，就能找到用户偏好的电影推荐给他。代码如下：

 1 ## Find the top-n un-rating movies of the jth user
 2 ## Args :
 3 ##     topn  -  a positive integer
 4 ##     theta_j  -  a vector, contains theta parameters of jth user
 5 ##     X  -  a matrix, contains X paramters
 6 ##     Y  -  a matrix, rating of movies by users 
 7 ##     R  -  a matrix, flags, which movies are rated by user
 8 ##     j  -  a interger, jth user
 9 ##     Ymean  -  a vector, contains mean of all movies
10 ##     isAddMean  -  flag, is the mean added to rating credit, or not?
11 ##                   TRUE  -  add the mean to rating credit
12 ##                   FALSE  -  the mean will not be added to the rating credit
13 ## Returns :
14 ##     a list, contains the index of the top-n un-rating movies of jth user and the rating result
15 findTopN <- function(topn, theta_j, X, R, j, Ymean=NULL, isAddMean=FALSE)
16 {
17     idx <- which(R[,j]==0)    
18     Yvec <- rep(-1,length(R[,j]))
19       
20     if (isAddMean == TRUE) {
21         if (is.null(Ymean)) {
22             cat("Error: Ymean is NULL!!!
")
23         }
24         Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) )   +   as.matrix( Ymean[idx], ncol=1)
25     } else {
26         Yvec[idx] <- tcrossprod( X[idx,], matrix(theta_j, nrow=1) )   
27     }
28     
29     sortTating <- sort(Yvec, method = "sh", decreasing = TRUE, index.return = TRUE)
30     objList <- list(ratingResult = Yvec, topnIndex = sortTating$ix[1:topn])
31     return( objList )
32 }

（寻找N个数中最大的K个数的方法，可参考链接：http://zhangzhibiao02005.blog.163.com/blog/static/3736782020112123353695/）