Intro. 3D Object Detection

Traditional Pipeline

Background(ground) removal - spatiotemporal clustering - classification

Comparison to 2D object detection CNN

There are two key differences: 1) the point cloud is a sparse representation, while an image is dense and 2) the point cloud is 3D, while the image is 2D.

3D data is crucial for self-driving cars, autonomous robots, virtual and augmented reality. Different from 2D images that are represented as pixel arrays, it can be represented as polygonal mesh, volumetric pixel grid, point cloud, etc.

image from: Create 3D model from a single 2D image in PyTorch

Deep Learning based 3D Object Detection

Benchmark network

LiDAR only

PointNet(2017)
VoxelNet(2018)
SECOND(2018)
Point-Pillar(2019)

LiDAR+Vision

ContFuse(2018)
Frustum PointNet
MV3D (2017)
AVOD (2018)
PIXOR++(2018)

Taxonomy for 3D Object Detection Solutions

Point-cloud based

Projection : Project point clouds onto image (front-view, bird’s eye projection). (BEV)
Volumetric : Encode point cloud to a volumetric voxel grid before processing them.
PointNet : Utilize PointNet-architecture.

Fusion based

Combine two or more sensor inputs to improve the overall performance of 3DOD.
Early fusion, late fusion, deep fusion

BEV based

Recent methods tend to view the lidar point cloud from a bird’s eye view (BEV, 2D)

MV3D, AVOD, PIXOR, Complex-YOLO, PointPillar
BEV preserves the object scales
Convolutions in BEV preserve the local range information

However, the bird’s eye view tends to be extremely sparse which makes direct application of convolutional neural networks impractical and inefficient.

Typically use grids 10x10cm and perform feature encoding to each grid cell.

Manually encoding features? How to do feature extraction?

****

3D Voxel based

VoxelNet, Vote3Deep, Point RCNN

VoxelNet

One of the first methods to truly do end-to-end learning in this domain.

VoxelNet divides the space into voxels, applies a PointNet to each voxel, followed by a 3D convolutional middle layer to consolidate the vertical axis, after which a 2D convolutional detection architecture is applied.

Slow ~4Hz due to 3D convolution

SECOND

An improvement on VoxelNet and improved inference time, but 3D convolution is still bottleneck

Fusion based

Frustum PointNet

Uses PointNets to segment and classify the point cloud in a frustum generated from projecting a detection on an image into 3D. It achieved high benchmark performance compared to other fusion methods, but its multi-stage design makes end-to-end learning impractical.

Other notes

Early works used 3D Covolutional network for detection but it is quite slow.

Recent works improve run-time by projecting 3D point cloud on either (1) Ground plane or BEV (2) image plane.

Fixed Encoder

For these methods, commonly the point cloud is organized in voxels and the set of voxels in each vertical column is encoded into a fixed-length, hand-crafted, feature encoding to form a pseudo-image which can be processed by a standard image detection architecture.

MV3D, AVOD : fuse with Vision and a two-stage pipeline
PIXOR, ComplexYOLO: a single-stage pipeline

Learned Encoder

PointNet: Learning from unordered pint sets for full end-to-end learning

VoxelNet: PointNet based that applied on Lidar points and use 3D conv and 2D conv layers.

Code

Matlab

OpenDCDet

Useful github for LiDAR 3D object Detection: OpenMMLab

https://github.com/open-mmlab/OpenPCDet

OpenPCDet is a clear, simple, self-contained open source project for LiDAR-based 3D object detection.

OpenPCDet is a general PyTorch-based codebase for 3D object detection from point cloud. It currently supports multiple state-of-the-art 3D object detection methods with highly refactored codes for both one-stage and two-stage 3D detection frameworks.\

Reference

[Paper Review] VoxelNet: End-to-end Learning for Point Cloud Based 3D Object Detection

[집콕]자율주행 인공지능 시스템

http://www.kmooc.kr/courses/course-v1:NGV+NGV01+2020_A1/course/

PreviousProcessing of Point Cloud NextPointNet

Last updated 3 years ago

Was this helpful?