LossFunction Regularization

Batch Normalization

Batch normalization greatly reduces the variation in the loss landscape, gradient productiveness, and β-smoothness, making the task of navigating the terrain to find the global error minima much easier.

  • More freedom in setting the initial learning rate. Large initial learning rates will not result in missing out on the minimum during optimization, and can lead to quicker convergence.

  • Accelerate the learning rate decay.

  • Remove dropout. One can get away with not using dropout layers when using batch normalization, since dropout can provide damage and/or slow down the training process. Batch normalization introduces an additional form of resistance to overfitting.

  • Reduce L2 weight regularization.

  • Solving the vanishing gradient problem.

  • Solving the exploding gradient problem.

For example: We train our data on only black cats’ images. So, if we now try to apply this network to data with colored cats, it is obvious; we’re not going to do well. The training set and the prediction set are both cats’ images but they differ a little bit. In other words, if an algorithm learned some X to Y mapping, and if the distribution of X changes, then we might need to retrain the learning algorithm by trying to align the distribution of X with the distribution of Y.

Image foDeeplearning.ai: Why Does Batch Norm Work? (C2W3L06)

Batch normalization allows each layer of a network to learn by itself a little bit more independently of other layers.

https://arxiv.org/pdf/1502.03167v3.pdf
  • We can use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really high or really low. And by that, things that previously couldn’t get to train, it will start to train.

  • It reduces overfitting because it has a slight regularization effects. Similar to dropout, it adds some noise to each hidden layer’s activations. Therefore, if we use batch normalization, we will use less dropout, which is a good thing because we are not going to lose a lot of information. However, we should not depend only on batch normalization for regularization; we should better use it together with dropout.

Effect of Regularization

Regularization refers to the practice of constraining /regularizing the model from learning complex concepts, thereby reducing the risk of overfitting.

Regularization Methods

  • Dropout Regularization

  • L2 Regularization

  • L1 Regularization

Effects of Methods

  • Dropout has the best performance among other regularizers. Dropout has both weight regularization effect and induces sparsity.

  • L1 Regularization has a tendency to produce sparse weights whereas L2 Regularization produces small weights

  • Regularization hyper parameters for CONV and FC layers should tuned separately.

Effect of Batch size

We use mini-batches because it tends to converge more quickly, allow us to parallelize computations

What is Batch Size

Neural networks are trained to minimize a loss function of the following form:

Image for post

Figure 1: Loss function. Adapted from Keskar et al [1].

Stochastic gradient descent computes the gradient on a subset of the training data, B_k, as opposed to the entire training dataset.

Image for pFigure 2: Stochastic gradient descent update equation. Adapted from Keskar et al [1].ost

Usually small Batch size perform better

Figure 5: Training and validation loss curves for different batch sizes
Figure 23: Training and validation loss for different batch sizes, with adjusted learning rates for post

Training with small batch sizes tends to converge to flat minimizers that vary only slightly within a small neighborhood of the minimizer, whereas large batch sizes converge to sharp minimizers, which vary sharply [1]

  • Small batch sizes perform best with smaller learning rates, while large batch sizes do best on larger learning rates.

  • Linear scaling rule: when the minibatch size is multiplied by k, multiply the learning rate by k.

  • When the right learning rate is chosen, larger batch sizes can train faster, especially when parallelized.

Last updated

Was this helpful?