吴恩达深度学习笔记course2 week2 测验

1. 第 1 个问题

Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

a^{[8]{3}(7)}a[8]{3}(7)

a^{[3]{8}(7)}a[3]{8}(7)    √

a^{[8]{7}(3)}a[8]{7}(3)

a^{[3]{7}(8)}a[3]{7}(8)

第 2 个问题
1
point

2. 第 2 个问题

Which of these statements about mini-batch gradient descent do you agree with?

Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.

You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).

One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.  √

第 3 个问题
1
point

3. 第 3 个问题

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.

If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.                                                     √

If the mini-batch size is 1, you end up having to process the entire training set before making any progress.

If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.  √

第 4 个问题
1
point

4. 第 4 个问题

Suppose your learning algorithm’s cost JJ, plotted as a function of the number of iterations, looks like this:

Which of the following do you agree with?

Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.

If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.

Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.

If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.√

第 5 个问题
1
point

5. 第 5 个问题

Suppose the temperature in Casablanca over the first three days of January are the same:

Jan 1st:  heta_1 = 10^o Cθ1=10oC

Jan 2nd:  heta_2 10^o Cθ210oC

(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)

Say you use an exponentially weighted average with eta = 0.5β=0.5 to track the temperature: v_0 = 0v0=0, v_t = eta v_{t-1} +(1-eta) heta_tvt=βvt1+(1β)θt. If v_2v2 is the value computed after day 2 without bias correction, and v_2^{corrected}v2corrected is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don't actually need one. Remember what is bias correction doing.)

v_2 = 10v2=10, v_2^{corrected} = 10v2corrected=10

v_2 = 7.5v2=7.5, v_2^{corrected} = 10v2corrected=10  √

v_2 = 7.5v2=7.5, v_2^{corrected} = 7.5v2corrected=7.5

v_2 = 10v2=10, v_2^{corrected} = 7.5v2corrected=7.5

第 6 个问题
1
point

6. 第 6 个问题

Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.

alpha = frac{1}{sqrt{t}} alpha_0α=t1α0

alpha = frac{1}{1+2*t} alpha_0α=1+2t1α0

alpha = e^t alpha_0α=etα0​  √

alpha = 0.95^t alpha_0α=0.95tα0

第 7 个问题
1
point

7. 第 7 个问题

You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v_{t} = eta v_{t-1} + (1-eta) heta_tvt=βvt1+(1β)θt. The red line below was computed using eta = 0.9β=0.9. What would happen to your red curve as you vary etaβ? (Check the two that apply)

Decreasing etaβ will shift the red line slightly to the right.

Increasing etaβ will shift the red line slightly to the right.  √

Decreasing etaβ will create more oscillation within the red line.  √

Increasing etaβ will create more oscillations within the red line.

第 8 个问题
1
point

8. 第 8 个问题

Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum (etaβ = 0.5) and gradient descent with momentum (etaβ = 0.9). Which curve corresponds to which algorithm?

(1) is gradient descent with momentum (small etaβ). (2) is gradient descent. (3) is gradient descent with momentum (large etaβ)

(1) is gradient descent. (2) is gradient descent with momentum (large etaβ) . (3) is gradient descent with momentum (small etaβ)

(1) is gradient descent. (2) is gradient descent with momentum (small etaβ). (3) is gradient descent with momentum (large etaβ)  √

(1) is gradient descent with momentum (small etaβ), (2) is gradient descent with momentum (small etaβ), (3) is gradient descent

第 9 个问题
1
point

9. 第 9 个问题

Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function mathcal{J}(W^{[1]},b^{[1]},..., W^{[L]},b^{[L]})J(W[1],b[1],...,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value formathcal{J}J? (Check all that apply)

Try initializing all the weights to zero

Try using Adam   √

Try mini-batch gradient descent  √

Try tuning the learning rate alphaα  √

Try better random initialization for the weights  √

第 10 个问题
1
point

10. 第 10 个问题

Which of the following statements about Adam is False?

Adam should be used with batch gradient computations, not with mini-batches.  √

Adam combines the advantages of RMSProp and momentum    

We usually use “default” values for the hyperparameters eta_1, eta_2β1,β2 and varepsilonε in Adam (eta_1 = 0.9β1=0.9, eta_2 = 0.999β2=0.999, varepsilon = 10^{-8}ε=108)

The learning rate hyperparameter alphaα in Adam usually needs to be tuned.

---------------------------------------------------------中文版------------------------------------------------------------------------

    1. 当输入从第八个mini-batch的第七个的例子的时候,你会用哪种符号表示第三层的激活?

      • [3]8(7)  a[3]8(7)

      注:[i]{j}(k)上标表示 第i层第j小块第k个示例

    2. 关于mini-batch的说法哪个是正确的?

      • 【 】 在不同的mini-batch下,不需要显式地进行循环,就可以实现mini-batch梯度下降,从而使算法同时处理所有的数据(矢量化)。
      • 【 】 使用mini-batch梯度下降训练的时间(一次训练完整个训练集)比使用梯度下降训练的时间要快。
      • 】mini-batch梯度下降(在单个mini-batch上计算)的一次迭代快于梯度下降的迭代。

      注意:矢量化不适用于同时计算多个mini-batch。

    3. 为什么最好的mini-batch的大小通常不是1也不是m,而是介于两者之间?

      • 】如果mini-batch大小为1,则会失去mini-batch示例中矢量化带来的的好处。
      • 】如果mini-batch的大小是m,那么你会得到批量梯度下降,这需要在进行训练之前对整个训练集进行处理。
    4. 如果你的模型的成本J 随着迭代次数的增加,绘制出来的图如下,那么:
      img1

      • 】如果你使用的是mini-batch梯度下降,这看起来是可以接受的。但是如果你使用的是下降,那么你的模型就有问题。

      注意:使用mini-batch梯度下降会有一些振荡,因为mini-batch中可能会有一些噪音数据。 然而,批量梯度下降总是保证在到达最优值之前达到较低的J

    5. 假设一月的前三天卡萨布兰卡的气温是一样的:
      一月第一天: θ 1  θ1 = 10

      一月第二天: θ 2  θ2 * 10

      假设您使用β β = 0.5的指数加权平均来跟踪温度:0  v0 = 0,t  vt =βt1  βvt−1 +(1-β β )θ t  θt 。 如果2  v2 是在没有偏差修正的情况下计算第2天后的值,并且correcte2  v2corrected 是您使用偏差修正计算的值。 这些下面的值是正确的是?

      • 2  v2 = 7.5, correcte2  v2corrected = 10
    6. 下面哪一个不是比较好的学习率衰减方法?

      • 】α = t  et * α 0  α0

      请注意:这会使得学习率出现爆炸,而没有衰减。

    7. 您在伦敦温度数据集上使用指数加权平均值, 您可以使用以下公式来追踪温度:t  vt = βt  βvt -1 +(1 - β β )θ t  θt 。 下面的红线使用的是β= 0.9来计算的。 当你改变β时,你的红色曲线会怎样变化?
      img 2

      • 】增加β会使红线稍微向右移动。
      • 】减少β会在红线内产生更多的振荡。
    8. 看一下这个图:
      img 3
      这些图是由梯度下降产生的; 具有动量梯度下降(β= 0.5)和动量梯度下降(β= 0.9)。 哪条曲线对应哪种算法?
      -【】(1)是梯度下降。 (2)是动量梯度下降(β值比较小)。 (3)是动量梯度下降(β比较大)

    9. 假设在一个深度学习网络中批处理梯度下降花费了太多的时间来找到一个值的参数值,该值对于成本函数J(W[1],b[1],…,W[L],b[L])来说是很小的值。 以下哪些方法可以帮助找到J值较小的参数值?

      • 】尝试使用 Adam 算法
      • 】尝试对权重进行更好的随机初始化
      • 】尝试调整学习率α
      • 】尝试mini-batch梯度下降
      • 【 】 尝试把权值初始化为0
    10. 关于Adam算法,下列哪一个陈述是错误的?

      • 】Adam应该用于批梯度计算,而不是用于mini-batch。

      注: Adam 可以同时使用。

原文地址:https://www.cnblogs.com/Dar-/p/9393032.html