ShuffleNet
Last updated
Last updated
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
ShuffleNet(CVPR 2018) pursues the best accuracy in very limited computational budgets at tens or hundreds of MFLOPs, focusing on common mobile platforms such as drones, robots, and smartphones. By shuffling the channels, ShuffleNet outperforms MobileNetV1. In ARM device, ShuffleNet achieves 13× actual speedup over AlexNet while maintaining comparable accuracy.
Need to understand the concept of 'Grouped Convolution'
"If we allow group convolution to obtain input data from different groups (as shown in Fig 1 (b)), the input and output channels will be fully related.This can be efficiently and elegantly implemented by a channel shuffle operation (Fig 1 (c)): suppose a convolutional layer with g groups whose output has g × n channels; we first reshape the output channel dimension into (g, n), transposing and then flattening it back as the input of next layer. Note that the operation still takes effect even if the two convolutions have different numbers of groups. Moreover, channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training."
(a) Bottleneck Unit: This is a standard residual bottleneck unit, but with depthwise convolution used. It can be also treated as a bottleneck type of depthwise separable convolution used in MobileNetV2.
Even though depthwise convolution usually has very low theoretical complexity, we find it difficult to efficiently implement on lowpower mobile devices, which may result from a worse computation/memory access ratio compared with other dense operations.
In ShuffleNet units, we intentionally use depthwise convolution only on bottleneck in order to prevent overhead as much as possible.
(b) ShuffleNet Unit: The first and second 1×1 convolutions are replaced by group convolutions. A channel shuffle is applied after the first 1×1 convolution.
(c) ShuffleNet Unit with Stride=2: When stride is applied, a 3×3 average pooling on the shortcut path is added. (Stride=2 reduces the image size by half). Also, the element-wise addition is replaced with channel concatenation, which makes it easy to enlarge channel dimension with little extra computation cost.
Given the input c_×_h_×_w, and bottleneck channels m, ResNet unit requires hw(2_cm_+9_m_²) FLOPs and ResNeXt requires hw(2_cm_+9_m_²/g) FLOPs, while ShuffleNet only requires hw(2cm/g+9m) FLOPs where g is the number of group convolutions. Given a computational budget, ShuffleNet can use wider feature maps
With g = 1, i.e. no pointwise group convolution.
Models with group convolutions (g > 1) consistently perform better than the counterparts without pointwise group convolutions (g = 1).
With similar accuracy, ShuffleNet is much more efficient than VGGNet, GoogLeNet, AlexNet and SqueezeNet.
Compared with AlexNet, ShuffleNet 0.5× model still achieves ~13× actual speedup under comparable classification accuracy (the theoretical speedup is 18×).