CS224d 之学习总结-第一部分

转载注明出处http://www.cnblogs.com/NeighborhoodGuo/p/4751759.html

听完斯坦福大学的CS224d公开课真是受益匪浅，课程安排紧凑而且内容翔实由浅入深，使我不仅对各种神经网络模型有一个认识，还对NLP的原理有了比较深刻的认识。

这门课程分为三个部分：第一部分是NLP的基本原理和DL的基础知识，DL的基础知识在ULFDL上也有，只不过ULFDL上讲解的大多是基于图像处理应用方面的，而CS224d主要是基于NLP应用方面的。第二部分就着重讲了在NLP方面应用效果比较好的几个DL模型。第三部分继续讲了一个在NLP方面应用效果好的DL模型，其他时间大多请知名公司的工程师过来做了讲座，探讨DL在NLP方面的实际应用的问题，非常的Advanced。

第一部分总结

单词的表示方法

众所周知，语言基本的单元是句子，句子基本的单元是单词，于是NLP处理的一个最基础的问题就是如何表示一个单词？

One-hot vector

有一种最简单的表示方法，就是使用One-hot vector表示单词，即根据单词的数量|V|生成一个|V| * 1的向量，当某一位为一的时候其他位都为零，然后这个向量就代表一个单词。缺点也很明显：1.由于向量长度是根据单词个数来的，如果有新词出现，这个向量还得增加，麻烦！(Impossible to keep up to date) 2. 主观性太强(subjective) 3. 这么多单词，还得人工打labor并且adapt，想想就恐怖 4.这是最不能忍受的一点，很难计算单词之间的相似性。

co-occurrence matrix

使用co-occurrence matrix表示单词，每一行的ID代表一个word，每一列代表一个word的ID。而它们确定的在矩阵中的位置是根据neighbor确定的。neighbor是什么呢？别急，下面会讲。CM(co-occurrence matrix)代表的是每一个word在一个特定的neighbor里其他单词出现的次数。CM当然也有很大的问题，第一个就是随着单词的增长，矩阵的大小也在增大；第二个是维度太高（和单词数目一致），需要的存储量很大；第三个是太Sparsity，不利于之后的Classification 处理。这可怎么办呢？没事，方法总比问题多。SVD闪亮登场！SVD是线性代数里的一种很牛掰的方法，能够用较低的维度读出一个矩阵中的大部分有用信息。

neighbor有两种主要的方法，第一种是word-document，第二种是word-windows。word-document就是将一个个的document作为一个基本单位建立CM，word-windows是将一个个的word周围某个数目范围内的单词作为一个基本单位建立CM。

CM的改进措施：在CM中有些单词对语意的理解是没有作用的，我们可以把这些单词hack掉。这些单词大多是function word(比如the, a, an, of)。还有一个就是之前我们定义矩阵中某一位置的值只是单纯的根据neighbor数数量，但是实际情况和这个肯定有所不同，肯定是距离某一word越近的单词其相关性（correlation）越强，于是应该定义一个Ramped windows（斜的窗）来建立CM，类似于用一个weight乘count，距离越远weight越小反之越大（weight是个bell shape函数）。

word2Vec

word2vec和之前的方法不同，之前建立CM是一次浇铸一次成型（- -），现在建立模型是慢慢的一个单词一个单词来。首先设定init parameter，然后根据这个parameter预测输入进来的word的context，然后如果context和ground truth类似就不变，如果不类似就penalize parameter（怎么有点像训练小狗，做的对给饭吃，做的不对抽两下 - -）

word2vec的核心理念就是上面的两个公式，当然为了优化性能和精确度可能在细微之处有改动，这就是后话了。

看第一个公式，核心理念就是为了让center word周围出现某种neighbor的概率最大，最好是1。

第二个公式就是为了求第一个公式最右边的概率所想出来的办法，word用两种方法表示，一种是input word vector，一种是output word vector。

好基友 Continuous Bag of Words Model (CBOW) and Skip-Gram Model

CBOW和Skip-gram非常类似，只不过CBOW的目的是为了从neighbor求出center word；而skip-gram是为了从center word求出neighbor words。

具体公式在课上的lecture note 1里的最后部分有。

Negative Sampling

CBOW和Skip-gram的计算的时候都要使用到softmax我们知道softmax的分母是根据所有可能的数据求结果然后求和，在本例中words的树目非常大，计算的cost当然也很大。于是使用negative sampling对上面的算法进行优化。

P_n(w)是negative sampling distrubution。目的是让正确的word的值大，其他随机的word和center word组合的值小。

Glove

通过前面的讲解我们知道word的neighbor有两种表示方法，第一种是word-document，第二种是word-windows。

word-document对于提取word的topic效果好(semantic效果好)

word-windows对于提取word的语法成份效果好(syntactic效果好)

Glove的目的就是想要综合这两种，最后做到对word的表示即sementic的表达效果好，syntactic的表达效果也好。

这有两篇论文讲得很详细：http://nlp.stanford.edu/pubs/glove.pdf http://www.aclweb.org/anthology/P12-1092

CBOW，skip-gram，negtive-sampling的代码实现：

  1 # Implement your skip-gram and CBOW models here
  2 
  3 # Interface to the dataset for negative sampling
  4 dataset = type('dummy', (), {})()
  5 def dummySampleTokenIdx():
  6     return random.randint(0, 4)
  7 def getRandomContext(C):
  8     tokens = ["a", "b", "c", "d", "e"]
  9     return tokens[random.randint(0,4)], [tokens[random.randint(0,4)] for i in xrange(2*C)]
 10 dataset.sampleTokenIdx = dummySampleTokenIdx
 11 dataset.getRandomContext = getRandomContext
 12 
 13 def softmaxCostAndGradient(predicted, target, outputVectors):
 14     """ Softmax cost function for word2vec models """
 15     ###################################################################
 16     # Implement the cost and gradients for one predicted word vector  #
 17     # and one target word vector as a building block for word2vec     #
 18     # models, assuming the softmax prediction function and cross      #
 19     # entropy loss.                                                   #
 20     # Inputs:                                                         #
 21     #   - predicted: numpy ndarray, predicted word vector (hat{r} in #
 22     #           the written component)                                #
 23     #   - target: integer, the index of the target word               #
 24     #   - outputVectors: "output" vectors for all tokens              #
 25     # Outputs:                                                        #
 26     #   - cost: cross entropy cost for the softmax word prediction    #
 27     #   - gradPred: the gradient with respect to the predicted word   #
 28     #           vector                                                #
 29     #   - grad: the gradient with respect to all the other word       # 
 30     #           vectors                                               #
 31     # We will not provide starter code for this function, but feel    #
 32     # free to reference the code you previously wrote for this        #
 33     # assignment!                                                     #
 34     ###################################################################
 35     
 36     ### YOUR CODE HERE
 37     target_exp = np.exp(np.dot(np.reshape(predicted, (1, predicted.shape[0])), 
 38                         np.reshape(outputVectors[target], (outputVectors[target].shape[0], 1))))
 39     all_exp = np.exp(np.dot(outputVectors, np.reshape(predicted, (predicted.shape[0], 1))))
 40     all_sum_exp = np.sum(all_exp)
 41     prob = target_exp / all_sum_exp
 42     cost = -np.log(prob)
 43     gradTarget = -predicted + prob * predicted
 44     
 45     other_exp = np.vstack([all_exp[0:target], all_exp[target + 1:len(all_exp)]]).flatten()
 46     other_sigmoid = other_exp / all_sum_exp
 47     grad = np.dot(np.reshape(other_sigmoid, (other_sigmoid.shape[0], 1)), 
 48                   np.reshape(predicted, (1, predicted.shape[0])))
 49     grad = np.vstack([grad[0:target, :], gradTarget, grad[target:grad.shape[0], :]])
 50     
 51     repmat_exp = np.tile(all_exp, (1, outputVectors.shape[1]))
 52     gradPred = -outputVectors[target] + np.sum(outputVectors * repmat_exp, 0) / all_sum_exp
 53     ### END YOUR CODE
 54     
 55     return cost, gradPred, grad
 56 
 57 def negSamplingCostAndGradient(predicted, target, outputVectors, K=10):
 58     """ Negative sampling cost function for word2vec models """
 59     ###################################################################
 60     # Implement the cost and gradients for one predicted word vector  #
 61     # and one target word vector as a building block for word2vec     #
 62     # models, using the negative sampling technique. K is the sample  #
 63     # size. You might want to use dataset.sampleTokenIdx() to sample  #
 64     # a random word index.                                            #
 65     # Input/Output Specifications: same as softmaxCostAndGradient     #
 66     # We will not provide starter code for this function, but feel    #
 67     # free to reference the code you previously wrote for this        #
 68     # assignment!                                                     #
 69     ###################################################################
 70     
 71     ### YOUR CODE HERE
 72     neg_indexes = [dataset.sampleTokenIdx() for k in range(K)]
 73 
 74     r_W = np.dot(predicted, outputVectors.T)
 75     sigmoid_all = sigmoid(r_W)
 76 
 77     cost = -np.log(sigmoid_all[target]) - np.sum(np.log(1 - sigmoid_all[neg_indexes]))
 78     
 79     gradPred = -outputVectors[target, :] * (1 - sigmoid_all[target])
 80     gradPred += np.dot(sigmoid_all[neg_indexes], outputVectors[neg_indexes, :])
 81 
 82     grad = np.zeros(np.shape(outputVectors))
 83     grad[target, :] = -predicted * (1 - sigmoid_all[target])
 84 
 85     for neg_index in neg_indexes:
 86         grad[neg_index,:] += predicted * sigmoid_all[neg_index]
 87     ### END YOUR CODE
 88     
 89     return cost, gradPred, grad
 90 
 91 def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors, word2vecCostAndGradient = softmaxCostAndGradient):
 92     """ Skip-gram model in word2vec """
 93     ###################################################################
 94     # Implement the skip-gram model in this function.                 #         
 95     # Inputs:                                                         #
 96     #   - currrentWord: a string of the current center word           #
 97     #   - C: integer, context size                                    #
 98     #   - contextWords: list of no more than 2*C strings, the context #
 99     #             words                                               #
100     #   - tokens: a dictionary that maps words to their indices in    #
101     #             the word vector list                                #
102     #   - inputVectors: "input" word vectors for all tokens           #
103     #   - outputVectors: "output" word vectors for all tokens         #
104     #   - word2vecCostAndGradient: the cost and gradient function for #
105     #             a prediction vector given the target word vectors,  #
106     #             could be one of the two cost functions you          #
107     #             implemented above                                   #
108     # Outputs:                                                        #
109     #   - cost: the cost function value for the skip-gram model       #
110     #   - grad: the gradient with respect to the word vectors         #
111     # We will not provide starter code for this function, but feel    #
112     # free to reference the code you previously wrote for this        #
113     # assignment!                                                     #
114     ###################################################################
115     
116     ### YOUR CODE HERE
117     # inputVectors VxD
118     # outputVectors VxD
119 
120     # cost float
121     # gradIn VxD
122     # gradOut VxD
123     cost = 0
124     predicted = inputVectors[tokens[currentWord]]
125     gradIn = np.zeros(inputVectors.shape)
126     gradOut = np.zeros(outputVectors.shape)
127     for contextWord in contextWords:
128         target = tokens[contextWord]
129         contextCost, contextGradPred, contextGrad = word2vecCostAndGradient(predicted, target, outputVectors)
130         cost += contextCost
131         gradIn[tokens[currentWord],:] += contextGradPred
132         gradOut += contextGrad
133     ### END YOUR CODE
134     
135     return cost, gradIn, gradOut
136 
137 def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors, word2vecCostAndGradient = softmaxCostAndGradient):
138     """ CBOW model in word2vec """
139     ###################################################################
140     # Implement the continuous bag-of-words model in this function.   #         
141     # Input/Output specifications: same as the skip-gram model        #
142     # We will not provide starter code for this function, but feel    #
143     # free to reference the code you previously wrote for this        #
144     # assignment!                                                     #
145     ###################################################################
146     
147     ### YOUR CODE HERE
148     in_rows = inputVectors.shape[0]
149     in_cols = inputVectors.shape[1]
150     
151     all_context_indx = np.zeros(2 * C)
152     for c in range(2 * C + 1):
153         if c == C:
154             target = tokens[currentWord]
155         elif c < C:
156             all_context_indx[c] = tokens[contextWords[c]]
157         else:
158             all_context_indx[c - 1] = tokens[contextWords[c - 1]]
159         
160     gradIn = np.zeros((in_rows, in_cols))   
161     all_context_indx_list = list(np.array(all_context_indx, int))
162     h = np.mean(inputVectors[all_context_indx_list], 0)
163 
164     cost, gradInTem, gradOut = word2vecCostAndGradient(h, target, outputVectors)
165     for context_indx in all_context_indx:
166         gradIn[context_indx] += gradInTem
167     gradIn = gradIn / 2 / C
168     ### END YOUR CODE
169     
170     return cost, gradIn, gradOut

对于word vector的评估方法

Intrinsic evaluation 是对VSM的一个简单迅速的评估。这种评估方法不放到整个系统中评估，而仅仅是评估一个subtask进行评估。评估过程很快，可以很好的理解这个系统。但是不知道放到实际的系统中是否也表现的很好。Intrinsic evaluation的第一种评估是Syntactic评估，这种评估方法问题比较少；第二种是semantic评估，存在一词多义的问题，还有corpus的数据比较旧的问题，这两个问题都会影响评估结果。Glove word vector是至今Intrinsic evaluation才是结果最好的model，Asymmetric context只评估左边window的单词效果不好。More training time and more data对评估结果很有帮助。

Extrinsic evaluation就是把VSM放到实际的任务中进行评估，花费时间较长，如果效果不好的话也不清楚是VSM的问题还是其他模块的问题或者是interaction的问题。有一个简单的办法确认是不是VSM的问题，把这个subsystem用其他的subsystem替换如果精度提高那就换上！

一词多义的问题

如果一个单词有很多个意思怎么办？如果你简单的就当作一个mean vector来处理那就会相当于把两个不同意思的向量进行向量相加，这显然是不准确的。解决方法在Notes讲得很详细，这里摘抄如下：

1. Gather fixed size context windows of all occurrences of the word(for instance, 5 before and 5 after)
2. Each context is represented by a weighted average of the context words’ vectors (using idf-weighting)
3. Apply spherical k-means to cluster these context representations.
4. Finally, each word occurrence is re-labeled to its associated cluster and is used to train the word representation for that cluster.

简单的说就是使用k-means聚类将如同的context先聚类出来，再给每个certriod赋相应的word，再把相应的context归给这个word，最后再用我们之前的普通训练方法训练。这就解决了一次多义的问题。

NN的基础

NN的基本知识：链接

BP的基本知识：链接

这两部分在斯坦福的公开网站上讲得非常详细，就不赘述了。

Practical recommendations for gradient-based training of deep architectures这一篇论文真的非常棒，讲述了对于Deep architectures实用方面的各个参数调整的方法和建议。

首先讲了各种non-linear function

然后是gradient check的方法，parameter initialization的方法，调整learning rate的方法，prevent overfitting的方法

Homework notes

nditer的使用方法：

http://docs.scipy.org/doc/numpy/reference/generated/numpy.nditer.html?highlight=nditer#numpy.nditer

http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html#arrays-nditer

python中lambda表达式的使用：

http://blog.sina.com.cn/s/blog_6163bdeb01018046.html

http://www.cnpythoner.com/post/97.html

————————————>>>>>>>>第二部分