[转] word2vec对each word使用两个embedding的原因

from: https://blog.csdn.net/weixin_42279926/article/details/106403211

相关stackflow: https://stackoverflow.com/questions/29381505/why-does-word2vec-use-2-representations-for-each-word

问题一：为什么训练过程中使用两种embedding表达？
参考于Stack Overflow中的Why does word2vec use 2 representations for each word?
，其中一个大佬HediBY引用论文word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method中的一个脚注去做了一点直觉性的解释，现在我把论文中的脚注放在下面：

Throughout this note, we assume that the words and the contexts come from distinct vocabularies, so that, for example, the vector associated with the word dog will be different from the vector associated with the context dog. This assumption follows the literature, where it is not motivated. One motivation for making this assumption is the following: consider the case where both the word dog and the context dog share the same vector v. Words hardly appear in the contexts of themselves, and so the model should assign a low probability to p(dog|dog), which entails assigning a low value to v · v which is impossible.

该脚注中提到一个假设，即作者认为，如果dog这个词作为中心词时，其对应的关联词为A，dog作为窗口词时，对应的关联词为B，此时A和B是不同的。事实上这个还算好理解，因为我们用dot product去表达的是一种条件概率，在条件概率里，p(A|dog)和p(dog|B)是完全不同的。举例来说，就是一个箱子里有红白蓝三种小球，数量分别是1，2，3，此时我们假设红色就是那个dog，白蓝球此时就是红球的关联词，所以p(白|红)和p(蓝|红)的概率是2/5和3/5，而p(红|白)和p(红|蓝)的概率是1/15和1/10，从这里就可以看到，如果我们用v·u去表达一个条件概率的话，那么自然会出现p(白|红)和p(红|白)的概率没有任何区别，这就和我们刚刚的例子有所出入，所以基于这个假设，使用了两个embedding，前者是中心词下的embedding，后者是context时下的embedding，他所表征的是两种不同的分布

除此之外，文章的脚注中还提到了一个motivation，在他给到的case中，我们可以知道p(dog|dog)应该是一个低概率的（我也想到了啊啊啊这种文本，但是这种词似乎可以看做一个语气词，也就是一个词），因为是低概率的，我们必须在v·v中表达的条件概率里给予一个很低的数值，但是，如果我们用一个embedding矩阵而不是两个，那么就会出现dog的vector每个维度的数值很低（因为我们是用的dot，自身乘自身，再用loss去优化，自然会让dog的vector数值变低），这种情况下，就严重影响到了dog和其他关联词用dot product描述的条件概率，因此，这也是使用了两个embedding矩阵的重要解释之一

现在我们回到大佬HediBY的解释，引用如下：

IMHO, the real reason why you use different representations is because you manipulate entities of different nature. “dog” as a context is not to be considered the same as “dog” as a center word because they are not. You basicly manipulate big matrices of occurences (word,context), trying to maximize the probability of these pairs that actually happen. Theoreticaly you could use as contexts bigrams, trying to maximize for instance the probability of (word=“for”, context=“to maximize”), and you would assign a vector representation to “to maximize”. We don’t do this because there would be too many representations to compute, and we would have a reeeeeally sparse matrix, but I think the idea is here : the fact that we use “1-grams” as context is just a particular case of all the kinds of context we could use.

可以看到，这位大佬的解释有两点，第一就是dog作为center word和作为context word是完全不同的，这点和刚刚的脚注说法一致，第二就是提出了为什么我们不用bigram而是用了unigram，原因在于会出现稀疏矩阵，而unigram也只是一种特殊情况，这完全是另一个话题了，已经在我想讨论的范围外，因此就不继续钻研了

最后，另外一位大佬dust也提出了这篇论文，他表示we can also use only one vector to represent a word.，这个是源自Stanford CS 224n，具体可不可以用，有没有什么改进，有兴趣的同学可以自行尝试
————————————————
版权声明：本文为CSDN博主「Sheep_0913」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/weixin_42279926/article/details/106403211

/* 人应该感到渺小，在宇宙面前，在美面前，在智慧面前；而在人群中，应该意识到自己的尊严。*/