Deep Learning 5: Gradient Descent

Gradients add up at forks. If a variable branches out to different parts of the circuit, then the gradients that flow back to it will add.

Example:

There are 3 paths from x to f, and 2 paths from y to f.

As for the fork of x, it braches out 3 paths and converges at f, when computing df/dx, the 3 gradients flowing back should be sumed up.

As for the fork of y, it braches out 2 paths and converges at f, when computing df/dy, the 2 gradients flowing back should be sumed up.

forward pass:

backward pass:

The size of the mini-batch is a hyperparameter but it is not very common to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2.

Reference:

1.http://cs231n.github.io/optimization-2/

2.https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b