DTU DeepLearning: exercise 6

Does batch_size have any affects in results quality? how to set the optimal batch size and number of iterations?

1. the batch size, a : the number of training data feed into neural network once.

2. the number of iterations, b : how many time we should feed into NN.

a*b = the number of training examples.

Batch Gradient Descent. Batch Size = Size of Training Set
Stochastic Gradient Descent. Batch Size = 1
Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

from here. The "sample size" you're talking about is referred to as batch size,

To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]

Batch gradient descent,
Online stochastic gradient descent:
Mini-batch stochastic gradient descent:

Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.

SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let

Batch gradient descent updates the weights

Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.

Let's say that we have some data vector

C = ⌈ T / B ⌉

For simplicity we can assume that T is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.

An iterative algorithm for SGD with

Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.

The major parameters for the vanilla (no momentum) SGD described above are:

Learning Rate:

I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.

ϵ (t) : N \to R

If you want to have the learning rate fixed, just define epsilon as a constant function.

Batch Size

Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.

Citations & Further Reading:

https://medium.com/@BorisAKnyazev/tutorial-on-graph-neural-networks-for-computer-vision-and-beyond-part-1-3d9fada3b80d

from here: when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. When you put m examples in a minibatch, you need to do O(m) computation and use O(m) memory, but you reduce the amount of uncertainty in the gradient by a factor of only O(sqrt(m)).