Intro. 3D Object Detection
Last updated
Last updated
Background(ground) removal - spatiotemporal clustering - classification
There are two key differences: 1) the point cloud is a sparse representation, while an image is dense and 2) the point cloud is 3D, while the image is 2D.
3D data is crucial for self-driving cars, autonomous robots, virtual and augmented reality. Different from 2D images that are represented as pixel arrays, it can be represented as polygonal mesh, volumetric pixel grid, point cloud, etc.
image from: Create 3D model from a single 2D image in PyTorch
PointNet(2017)
VoxelNet(2018)
SECOND(2018)
Point-Pillar(2019)
ContFuse(2018)
Frustum PointNet
MV3D (2017)
AVOD (2018)
PIXOR++(2018)
Point-cloud based
Projection : Project point clouds onto image (front-view, bird’s eye projection). (BEV)
Volumetric : Encode point cloud to a volumetric voxel grid before processing them.
PointNet : Utilize PointNet-architecture.
Fusion based
Combine two or more sensor inputs to improve the overall performance of 3DOD.
Early fusion, late fusion, deep fusion
Recent methods tend to view the lidar point cloud from a bird’s eye view (BEV, 2D)
MV3D, AVOD, PIXOR, Complex-YOLO, PointPillar
BEV preserves the object scales
Convolutions in BEV preserve the local range information
However, the bird’s eye view tends to be extremely sparse which makes direct application of convolutional neural networks impractical and inefficient.
Typically use grids 10x10cm and perform feature encoding to each grid cell.
Manually encoding features? How to do feature extraction?
VoxelNet, Vote3Deep, Point RCNN
VoxelNet
One of the first methods to truly do end-to-end learning in this domain.
VoxelNet divides the space into voxels, applies a PointNet to each voxel, followed by a 3D convolutional middle layer to consolidate the vertical axis, after which a 2D convolutional detection architecture is applied.
Slow ~4Hz due to 3D convolution
SECOND
An improvement on VoxelNet and improved inference time, but 3D convolution is still bottleneck
Frustum PointNet
Uses PointNets to segment and classify the point cloud in a frustum generated from projecting a detection on an image into 3D. It achieved high benchmark performance compared to other fusion methods, but its multi-stage design makes end-to-end learning impractical.
Early works used 3D Covolutional network for detection but it is quite slow.
Recent works improve run-time by projecting 3D point cloud on either (1) Ground plane or BEV (2) image plane.
Fixed Encoder
For these methods, commonly the point cloud is organized in voxels and the set of voxels in each vertical column is encoded into a fixed-length, hand-crafted, feature encoding to form a pseudo-image which can be processed by a standard image detection architecture.
MV3D, AVOD : fuse with Vision and a two-stage pipeline
PIXOR, ComplexYOLO: a single-stage pipeline
Learned Encoder
PointNet: Learning from unordered pint sets for full end-to-end learning
VoxelNet: PointNet based that applied on Lidar points and use 3D conv and 2D conv layers.
Useful github for LiDAR 3D object Detection: OpenMMLab
https://github.com/open-mmlab/OpenPCDet
OpenPCDet
is a clear, simple, self-contained open source project for LiDAR-based 3D object detection.
OpenPCDet
is a general PyTorch-based codebase for 3D object detection from point cloud. It currently supports multiple state-of-the-art 3D object detection methods with highly refactored codes for both one-stage and two-stage 3D detection frameworks.\
[집콕]자율주행 인공지능 시스템