AlexNet
Last updated
Last updated
Krizhevsky, Sutskever, Hinton, “Imagenet classification with deep convolutional neural networks”. NIPS 2012
AlexNet is the winner of ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012.
Prior to ILSVRC 2012, competitors mostly used feature engineering techniques combined with a classifier (i.e SVM).
AlexNet marked a breakthrough in deep learning where a CNN was used to reduce the error rate in ILSVRC 2012 substantially and achieve the first place of the ILSVRC competition.
The highlights of this paper:
Breakthrough in Deep Learning using CNN for image classification.
ReLU (Rectified Linear Unit)
Multiple GPUs
Local Response Normalization
Overlapping Pooling
Data Augmentation
Dropout
Other Details of Learning Parameters
Results
Note that Group convolution is applied here. Thus, from 2nd layer, number of kernels are divided by 2 for each group. e.g. 256 of 5x5x48 --> (128 of 5x5x48) *2
AlexNet contains eight layers:
Input: 224×224×3 input images
1th: Convolutional Layer: 96 kernels of size 11×11×3 (stride: 4, pad: 0) 55×55×96 feature maps Then 3×3 Overlapping Max Pooling (stride: 2) 27×27×96 feature maps Then Local Response Normalization 27×27×96 feature maps
2nd: Convolutional Layer: 256 kernels of size 5×5×48 (stride: 1, pad: 2) 27×27×256 feature maps **Then 3×3 Overlapping Max Pooling (stride: 2) 13×13×256 feature maps Then Local Response Normalization** 13×13×256 feature maps
3rd: Convolutional Layer: 384 kernels of size 3×3×128 (stride: 1, pad: 1) 13×13×384 feature maps
4th: Convolutional Layer: 384 kernels of size 3×3×192 (stride: 1, pad: 1) 13×13×384 feature maps
5th: Convolutional Layer: 256 kernels of size 3×3×192 (stride: 1, pad: 1) 13×13×256 feature maps Then 3×3 Overlapping Max Pooling (stride: 2) 6×6×256 feature maps
6th: Fully Connected (Dense) Layer of 4096 neurons
7th: Fully Connected (Dense) Layer of 4096 neurons
8th: Fully Connected (Dense) Layer of Output: 1000 neurons (since there are 1000 classes) Softmax is used for calculating the loss.
In total, there are 60 million parameters need to be trained !!!
Train with Stochastic Gradient Descent with:
Batch size: 128
Momentum: 0.9
Weight Decay: 0.0005
Initialize the weights in each layer from a zero-mean Gaussian distribution with std 0.01.
Bias: Initialize 1 for 2nd, 4th, 5th conv layers and fully-connected layers. Initialize 0 for remaining layers.
Learning rate: 0.01. Equal learning rate for all layers and diving by 10 when validation error stopped improving.
We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005.
We found that this small amount of weight decay was important for the model to learn.
We trained the network for roughly 90 cycles through the training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs.
Batch size: 128 Momentum v: 0.9 Weight Decay: 0.0005 Learning rate ϵ: 0.01, reduced by 10 manually when validation error rate stopped improving, and reduced by 3 times.
Training set of 1.2 million images. Network is trained for roughly 90 cycles. Five to six days on two NVIDIA GTX 580 3GB GPUs.
Initialization
initialized the weights in each layer from a zero-mean Gaussian distribution with standard de- viation 0.01.
Bias:
constant 1 for second, fourth, and fifth convolutional layers and FC ← provide RELU with positive inputs
constant 0 for others
Before Alexnet, Tanh was used. ReLU is introduced in AlexNet. And ReLU is six times faster than Tanh to reach 25% training error rate.
At that moment, NVIDIA GTX 580 GPU is used which only got 3GB of memory. Thus, we can see in the architecture that they split into two paths and use 2 GPUs for convolutions. Inter-communications are only occurred at one specific convolutional layer.
Thus, using 2 GPUs, is due to memory problem, NOT for speeding up the training process.
With the whole network compared with a net with only half of kernels (only one path), Top-1 and top-5 error rates are reduced by 1.7% and 1.2% respectively.
****
ReLUs have the desirable property that they do not require input normalization to prevent them from saturating.
In AlexNet, local response normalization is used. It is different from the batch normalization as we can see in the equations. Normalization helps to speed up the convergence.
Nowadays, batch normalization is used instead of using local response normalization.
With local response normalization, Top-1 and top-5 error rates are reduced by 1.4% and 1.2% respectively.
****
Overlapping Pooling is the pooling with stride smaller than the kernel size while Non-Overlapping Pooling is the pooling with stride equal to or larger than the kernel size.
With overlapping pooling, Top-1 and top-5 error rates are reduced by 0.4% and 0.3% respectively.
****
****
Datasets turns out to be insufficient to learn so many parameters without considerable overfitting
The dataset using label-preserving transformations
generating image translations and horizontal reflec- tions
y extracting random 224×224 patches (and their horizontal reflections) from the 256×256 images
increases the size of our training set by a factor of 2048
This is the reason why the input images in Figure 2 are 224 × 224 × 3-dimensional.
altering the intensities of the RGB channels in training images
we perform PCA on the set of RGB pixel values with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1.
reduced the top-1 error rate by over 1%.
Dropout
Instead of Combining the predictions of many different models , Dropout makes the neural network samples a different architecture, but all these architectures share weights
setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back- propagation.
presented, the neural network samples a different architecture, but all these architectures share weights.
This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to
It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons
We use dropout in the first two fully-connected layers of Figure 2.
Without dropout, our network ex- hibits substantial overfitting. Dropout roughly doubles the number of iterations to converge