Convolution

There are different types of Convolution in CNN such as

2D Conv, 3D Conv
1x1 Conv, BottleNeck
Spatially Separable
Depthwise Separable
Grouped Convolution
Shuffled Grouped

Read the following blog for more detailed explanations on types of convolution

A Comprehensive Introduction to Different Types of Convolutions in Deep LearningMedium

Convolution: single channel

It is the element-wise multiplication and addition with window sliding.

Read the followings for more detailed information

Using 3x3 kernel. from 5x5=25 input features --> 3x3=9 output.

Common techniques in convolution

Padding: pad the edges with '0','1' or other values
With padding: WxHxC --> WxHxC
Without padding: WxHxC -> (W-w+1)x(H-h+1)xC

image from here

Striding: skip some of the slide locations
⌊(nh−kh+ph+sh)/sh⌋×⌊(nw−kw+pw+sw)/sw⌋.
With padding: WxHxC  (W+S-1)/S x (H+S-1)/S x C Without padding: WxHxC  (W-w+S)/S x (H-h+S)/S xC

Filter vs Kernel

For 2D convolution, kernel and filter are the same

For 3D convolution, a filter is the collection of the stacked kernels

2D Convolution: multiple channel

The filter has the same depth (channel) as the input matrix.

The output is 2D matrix.

Example: Input is 5x5x3 matrix. Filter is 3x3x3 matrix.

Then, three channels are summed by element-wise addition to form one single channel (3x3x1)

3D Convolution

A general form of convolution but the filter kernel size < channel size. The filter moves in three directions: height, width, channel

The output is 3D matrix.

1D Convolution

Input: HxWxD. Filtering with 1x1xD produces the Output' HxWx1

Initially, proposed in 'Network-in-Network (2013)' . Widely used after introduced in 'Inception (2014)'

Dimensionality reduction for efficient computations
- HxWxD --> HxWx1
Efficient low dimensional embedding, or feature pooling
*
Applying nonlinearity again after convolution
- after 1x1 conv, non-linear activation(ReLU etc) can be added

Cost of Convolution

Calculation cost for a convolution depends on:

Input size: i*i*D
Kernel Size: k*k*D
Stride: s
Padding: p

The output image ( o*o*1 ) then becomes

The required operations are

o*o repetition of { (k*k) multiplications and (k*k-1) additions}

In terms of multiplications

For input of size H x W x D, 2D convolution (stride=1, padding=0) with Nc kernels of size h x h x D, where h is even
Total multiplications: Nc x h x h x D x (H-h+1) x (W-h+1)

Separable Convolution

Used in MobileNet(2017), Xception(2016) for efficient processing.

Spatially Separable Convolution

Not used much in deep learning. It is decomposing a convolution into two separate operations

Example: A Sobel kernel can be divided into a 3 x 1 and a 1 x 3 kernel.

Depthwise Separable Convolution

Commonly used in Deep Learning such as MobileNet and Xception. It is two steps of (1) Depthwise convolution (2) 1x1 convolution

For example: Input 7*7*3 --> 128 of 3*3*3 filters --> 5*5*128 output

Step1: Depthwise convolution
- Each layer of a single filter is separated into kernels. (e.g. 3 of 3x3x1)
- Each kernel convoles with 1 channel(only) layer input : (5*5*1) for each kernel
- Then, stack the maps to get the final output e.g. (5*5*3)

Step 2: 1*1 Convolution
- Apply 1*1 convolution with 1*1*3 kernels to get 5*5*1 map.
- Apply 128 of 1x1 convolutions to get 5*5*128 map

Standard 2D convolution vs Depthwise Convolution

Calculation comparison

Standard: 128*(3*3*3)*(5*5) multiplications
- 128*(3*3*3)*(5*5) =86,400
- Nc x h x h x D x (H-h+1) x (W-h+1)
Separable: 3*(3*3*1)*(5*5)+128*(1*1*3)*(5*5) multiplications
- =675+9600=10,275 (12%)
- D x h x h x 1 x (H-h+1) x (W-h+1) + Nc x 1 x 1 x D x (H-h+1) x (W-h+1) = (h x h + Nc) x D x (H-h+1) x (W-h+1)
The ratio of multiplication is

If Nc>>h, then it is approx. 1/(h^2). for 5x5 filters, 25 times more multiplications

Grouped Convolution

Introduced in AlexNet(2012), to do parallel convolutions. The filters are separated into different groups. Each group is responsible for standard 2D conv with certain depth. Then the each outputs are concetenated in depth-wise

Model-Parallelization for efficient training
- each group can be handled by different GPUs
- Better than data parallelization using batches
Efficient Computation
- Standard: h x w x Din x Dout
- Grouped: 2*(h x w x Din/2 x Dout/2)= (1/2)*(h x w x Din x Dout)

Shuffled Grouped Convolution

Introduced by ShuffleNet(2017) for computation -efficient convolution. The idea is mixing up the information from different filter groups to connect the information flow between the channel groups.

Read this blog for the paper explanations

Pointwise grouped convolution

The group operation is performed on the 3x3 spatial convolution, but not on 1 x 1 convolution. The ShuffleNet suggested 1x1 convolution on Group convolution

Group convolution of 1x1 filters instead of NxN filters (N>1).

PreviousOptimization NextCNN Overview

Last updated 3 years ago

Was this helpful?