Machine Learning Interview

I do NOT update this article any more. Notes of this kind will be kept personal starting today.

Q1: Assuming that we train the neural network with the same amount of training examples, how to set the optimal batch size and number of iterations? (where batch size * number of iterations = number of training examples shown to the neural network, with the same training example being potentially shown several times)

It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. Large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers. Large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.

When the slope is too steep, the gradient descent step can actually move uphill. Make a second-order Taylor series approximation to the cost function f(x):

Take the gradient descent step,

Q2: why do we need orthogonalization in neural network training?

With acknowledgment to Week 1, Course 2, deeplearning.ai

Q3: why do we use logistic regression?

The classification also requires linear regression values to fall between 0 and 1. The normalized error assumption in linear regression is not applicable.

Q4: Derive multinomial regression

Thanks to https://en.wikipedia.org/wiki/Multinomial_logistic_regression

Q5: why is gradient descent with momentum usually better than the plain gradient descent?

The gradient descent with momentum performs a weighted average of recently computed gradients and removes oscillation, so it approaches the optimal point faster.

Q6: The difference between max pooling and average pooling?

we perform pooling to increase invariance, reduce computation complexity (as 2*2 max pooling/average pooling reduces 75% data) and extract low-level features from the neighbourhood.

Max pooling extracts the most important features like edges whereas, average pooling extracts features smoothly. In global average pooling, a tensor with dimensions $h \times w \times d$

Q7: why do we use weight decay?

To avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to

And the new step will be

The new term coming from the regularization causes the weight to decay in proportion to its size.

Q8: what is a deconvolutional layer?

Deconvolution Layer is a very unfortunate name and is simply padding more zeros. For example, if we process a 2*2 patch that is padded with 2 zeros both on the left and on the right with a 3*3 kernel and stride 1, the patch becomes [(2+2*2-3)/1+1]=4. for visualizations see https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers

Q9: Why don't we use mse for classification?

MSE after Softmax causes vanishing gradients