Youtube深度学习推荐系统论文

1、问题建模

把推荐问题建模成一个“超大规模多分类”问题。即在时刻 $t$ ，为用户 $U$ （上下文信息 $C$ ）在视频库 $V$ 中精准的预测出视频 $i$ 的类别（每个具体的视频视为一个类别， $i$ 即为一个类别），用数学公式表达如下：

很显然上式为一个softmax多分类器的形式。向量 $uin R^N$ 是<user, context>信息的高纬“embedding”，而向量 $v_{j}in R^N$ 则是视频 j 的embedding向量。所以DNN的目标就是在用户信息和上下文信息为输入条件下学习用户的embedding向量 $u$ 。用公式表达DNN就是在拟合函数$u = f_{DNN}(user_{info}, context_{info})$。

这种超大规模分类问题上，至少要有几百万个类别，实际训练采用的是Negative Sample。（可以参考word2Vec： https://www.cnblogs.com/ljygoodgoodstudydaydayup/p/10839983.html）

2、模型架构

2.1 粗选阶段

整个模型架构是包含三个隐层的DNN结构。输入是用户浏览历史、搜索历史、人口统计学信息和其余上下文信息concat成的输入向量；输出分线上和离线训练两个部分。

离线训练阶段输出层为softmax层，输出上面公式表达的概率。而线上则直接利用user向量查询相关商品，最重要问题是在性能。

样本选择：

Training examples are generated from all YouTube watches (even those embedded on other sites) rather than just watches on the recommendations we produce. Otherwise, it would be very difficult for new content to surface and the recommender would be overly biased towards exploitation. If users are discovering videos through means other than our recommendations, we want to be able to quickly propagate this discovery to others via collaborative filtering. Another key insight that improved live metrics was to generate a fixed number of training examples per user, effectively weighting our users equally in the loss function.

正例样本：a user completing a video is a positive example

主要特征：

历史搜索query：把历史搜索的query分词后的token的embedding向量进行加权平均，能够反映用户的整体搜索历史状态
人口统计学信息：性别、年龄、地域等
其他上下文信息：设备、登录状态等

2.2 精排阶段

3、问题解释

3.1 如何解决候选分类太多的问题

A：负采样（negative sampling）并用importance weighting的方法对采样进行calibration。【To efficiently train such a model with millions of classes, we rely on a technique to sample negative classes from the background distribution (“candidate sampling”) and then correct for this sampling via importance weighting。】

3.2 user vector和video vector是怎么生成的？

A: user embedding是网络的最后一个隐层，vedio embedding是softmax的权重，$video vector$ 那条线是将vedio embedding存入ANN建库用于线上检索。这里说的softmax层是dense+ softmax激活函数，假设最后一个hidden layer维度是100，代表user embedding，输出节点维度200w表示videos，全连接权重维度就是[100，200w]，而hidden layer与每一个输出节点的权重维度就是[100，1]，这就是一个vedio对应embedding，计算一个vedio的概率时是u*v，即两个100维向量做内积，是可以在一个空间的。在serving阶段，一个用户向量(kx1) 和 m个向量(m*k)做点积操作，取最大的top-N，可以得到这个用户最喜欢的top-N个item。换另一个用户，仍然与相同的m个向量做点积后求top-N。求最近邻的业界方法有很多，比如ann，faiss等。