SSD
Last updated
Last updated
SSD: Single Shot MultiBox Detector (by W.Liu, C. Szegedy et al.,2016), object detector scoring over 74% mAP at 59 FPS on PascalVOC and COCO.
Single Shot: this means that the tasks of object localization and classification _are done in a _single forward pass of the network
MultiBox: this is the name of a technique for bounding box regression developed by Szegedy et al. (we will briefly cover it shortly)
Detector: The network is an object detector that also classifies those detected objects
Based on VGG-16. No FC layers but added a set of auxiliary convolutional layers (from conv6 onwards). This can extract features at multiple scales and progressively decrease the size of the input to each subsequent layer.
The bounding box regression technique of SSD is inspired by Szegedy’s work on MultiBox. MultiBox starts with the priors as predictions and attempt to regress closer to the ground truth bounding boxes. At the end, MultiBox only retains the top K predictions that have minimised both location (LOC) and confidence (CONF) losses.
Multibox: It is for Bounding Box Proposal of fast class-agnostic bounding box coordinate proposals.
Class-agnostic means bounding box of object without classfication
Multibox contains 11 priors per feature map cell (8x8, 6x6, 4x4, 3x3, 2x2) and only one on the 1x1 feature map, resulting in a total of 1420 priors per image
11*(8*8)+11*(6*6)+...+11*(2*2)=1419 + 1(1*1)=1420
MultiBox’s loss function also combined two critical components:
multibox_loss = confidence_loss + alpha * location_loss
Confidence Loss: this measures how confident the network is of the objectness of the computed bounding box. Categorical cross-entropy is used to compute this loss.
Location Loss: this measures how far away the network’s predicted bounding boxes are from the ground truth ones from the training set. L2-Norm is used here.
The SSD paper has around 6 bounding boxes per feature map cell.
The SSD paper makes the following additional observations:
more default boxes results in more accurate detection, although there is an impact on speed
having MultiBox on multiple layers results in better detection as well, due to the detector running on features at multiple resolutions
80% of the time is spent on the base VGG-16 network: this means that with a faster and equally accurate network SSD’s performance could be even better
SSD confuses objects with similar categories (e.g. animals). This is probably because locations are shared for multiple classes
SSD-500 (the highest resolution variant using 512x512 input images) achieves best mAP on Pascal VOC2007 at 76.8%, but at the expense of speed, where its frame rate drops to 22 fps. SSD-300 is thus a much better trade-off with 74.3 mAP at 59 fps.
SSD produces worse performance on smaller objects, as they may not appear across all feature maps. Increasing the input image resolution alleviates this problem but does not completely address it