CS224n assignment1 Q2 Neural Network Basics

(a) 推导sigmoid的导数公式

y = 1/(1+exp(-x))

Answer：
y' = y*(1-y)
sigmoid的导数形式是十分简洁的，这也是sigmoid函数使用广泛的一个原因。

(b) 当使用交叉熵作为loss function时，推导其梯度公式，输入的y是一个one-hot向量

Answer：

当我们有一个分值向量f，损失函数对这个分值向量求导的结果等于向量里每个类别对应的概率值，但除了那个正确类别的概率值，它要再减去1。例如，我们的概率向量为p = [0.2, 0.3, 0.5]，第二类为正确的类别，那么的分值梯度就为df = [0.2, -0.7, 0.5]。

(c) 求一个单隐藏层的神经网络对于输入x的梯度

即

求y关于x的梯度

Answer：
令z2 = hw2+b2,z1=xW1+b1
则可以用链式法则求解。

(d) 求上题中的网络中有多少参数

输入为Dx维，输出为Dy维，隐藏单元有H个

Answer:
输入为Dx维，channel为1，隐藏层有H个单元，所以W1有DxH个参数，b1有H个参数
同理，W2有DDy个参数，b2有Dy个参数。

(e)sigmoid及其梯度求解

def sigmoid(x):
    """
    Compute the sigmoid function for the input here.

    Arguments:
    x -- A scalar or numpy array.

    Return:
    s -- sigmoid(x)
    """
    s = 1/(1+np.exp(-x))
    return s


def sigmoid_grad(s):
    """
    Compute the gradient for the sigmoid function here. Note that
    for this implementation, the input s should be the sigmoid
    function value of your original input x.

    Arguments:
    s -- A scalar or numpy array.

    Return:
    ds -- Your computed gradient.
    """
    ds = s*(1-s)
    return ds

(f) 实现一个检查梯度的方法

通过数值方法来计算梯度，即利用梯度的定义来计算。
f'=(f(x+dx)-f(x))/x

def gradcheck_naive(f, x):
    """ Gradient check for a function f.

    Arguments:
    f -- a function that takes a single argument and outputs the
         cost and its gradients
    x -- the point (numpy array) to check the gradient at
    """

    rndstate = random.getstate()
    random.setstate(rndstate)
    fx, grad = f(x) # Evaluate function value at original point
    h = 1e-4        # Do not change this!

    # Iterate over all indexes ix in x to check the gradient.
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        # Try modifying x[ix] with h defined above to compute numerical
        # gradients (numgrad).

        # Use the centered difference of the gradient.
        # It has smaller asymptotic error than forward / backward difference
        # methods. If you are curious, check out here:
        # https://math.stackexchange.com/questions/2326181/when-to-use-forward-or-central-difference-approximations

        # Make sure you call random.setstate(rndstate)
        # before calling f(x) each time. This will make it possible
        # to test cost functions with built in randomness later.

        old_val = x[ix]
        x[ix] = old_val - h
        random.setstate(rndstate)
        (fxh1,_) = f(x)

        x[ix] = old_val + h
        random.setstate(rndstate)
        (fxh2,_) = f(x)

        numgrad = (fxh2-fxh1)/(2*h)
        x[ix] = old_val

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print ("Gradient check failed.")
            print ("First gradient error found at index %s" % str(ix))
            print ("Your gradient: %f 	 Numerical gradient: %f" % (
                grad[ix], numgrad))
            return

        it.iternext() # Step to next dimension

    print ("Gradient check passed!")

(g) 实现一个只有一个隐藏层的、激活函数为sigmoid的网络。

def forward_backward_prop(X, labels, params, dimensions):
    """
    Forward and backward propagation for a two-layer sigmoidal network

    Compute the forward propagation and for the cross entropy cost,
    the backward propagation for the gradients for all parameters.

    Notice the gradients computed here are different from the gradients in
    the assignment sheet: they are w.r.t. weights, not inputs.

    Arguments:
    X -- M x Dx matrix, where each row is a training example x.
    labels -- M x Dy matrix, where each row is a one-hot vector.
    params -- Model parameters, these are unpacked for you.
    dimensions -- A tuple of input dimension, number of hidden units
                  and output dimension
    """

    ### Unpack network parameters (do not modify)
    ofs = 0
    Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

    W1 = np.reshape(params[ofs:ofs+ Dx * H], (Dx, H))
    ofs += Dx * H
    b1 = np.reshape(params[ofs:ofs + H], (1, H))
    ofs += H
    W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
    ofs += H * Dy
    b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

    # Note: compute cost based on `sum` not `mean`.
    ### YOUR CODE HERE: forward propagation
    h = sigmoid(X.dot(W1) + b1)
    y_hat = softmax(h.dot(W2) + b2)
    ### END YOUR CODE

    ### YOUR CODE HERE: backward propagation
    cost = -np.sum(np.log(y_hat[labels==1]))/X.shape[0]

    d3 = (y_hat - labels)/X.shape[0] 
    gradW2 = np.dot(h.T,d3)
    gradb2 = np.sum(d3,axis = 0,keepdims=True)

    dh = np.dot(d3,W2.T)
    grad_h = sigmoid_grad(h) * dh 

    gradW1 = np.dot(X.T,grad_h)
    gradb1 = np.sum(grad_h,axis = 0)
    ### END YOUR CODE

    ### Stack gradients (do not modify)
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(),
        gradW2.flatten(), gradb2.flatten()))

    return cost, grad