LossFunction Regularization
Last updated
Last updated
Batch normalization greatly reduces the variation in the loss landscape, gradient productiveness, and β-smoothness, making the task of navigating the terrain to find the global error minima much easier.
More freedom in setting the initial learning rate. Large initial learning rates will not result in missing out on the minimum during optimization, and can lead to quicker convergence.
Accelerate the learning rate decay.
Remove dropout. One can get away with not using dropout layers when using batch normalization, since dropout can provide damage and/or slow down the training process. Batch normalization introduces an additional form of resistance to overfitting.
Reduce L2 weight regularization.
Solving the vanishing gradient problem.
Solving the exploding gradient problem.
For example: We train our data on only black cats’ images. So, if we now try to apply this network to data with colored cats, it is obvious; we’re not going to do well. The training set and the prediction set are both cats’ images but they differ a little bit. In other words, if an algorithm learned some X to Y mapping, and if the distribution of X changes, then we might need to retrain the learning algorithm by trying to align the distribution of X with the distribution of Y.
Batch normalization allows each layer of a network to learn by itself a little bit more independently of other layers.
We can use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really high or really low. And by that, things that previously couldn’t get to train, it will start to train.
It reduces overfitting because it has a slight regularization effects. Similar to dropout, it adds some noise to each hidden layer’s activations. Therefore, if we use batch normalization, we will use less dropout, which is a good thing because we are not going to lose a lot of information. However, we should not depend only on batch normalization for regularization; we should better use it together with dropout.
Regularization refers to the practice of constraining /regularizing the model from learning complex concepts, thereby reducing the risk of overfitting.
Dropout Regularization
L2 Regularization
L1 Regularization
Dropout has the best performance among other regularizers. Dropout has both weight regularization effect and induces sparsity.
L1 Regularization has a tendency to produce sparse weights whereas L2 Regularization produces small weights
Regularization hyper parameters for CONV and FC layers should tuned separately.
We use mini-batches because it tends to converge more quickly, allow us to parallelize computations
Neural networks are trained to minimize a loss function of the following form:
Figure 1: Loss function. Adapted from Keskar et al [1].
Stochastic gradient descent computes the gradient on a subset of the training data, B_k, as opposed to the entire training dataset.
Usually small Batch size perform better
Training with small batch sizes tends to converge to flat minimizers that vary only slightly within a small neighborhood of the minimizer, whereas large batch sizes converge to sharp minimizers, which vary sharply [1]
Small batch sizes perform best with smaller learning rates, while large batch sizes do best on larger learning rates.
Linear scaling rule: when the minibatch size is multiplied by k, multiply the learning rate by k.
When the right learning rate is chosen, larger batch sizes can train faster, especially when parallelized.