Point RCNN
Last updated
Last updated
Shi, Shaoshuai, Xiaogang Wang, and Hongsheng Li. "Pointrcnn: 3d object proposal generation and detection from point cloud." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. https://arxiv.org/abs/1812.04244
Github: https://github.com/sshaoshuai/PointRCNN
PointRCNN: 3D object detection from raw point cloud. The whole framework is composed of two stages:
stage-1 for the bottom-up 3D proposal generation from points
Instead of using bird-eye view or voxels as previous models
stage-2 for refining proposals in the canonical coordinates to obtain the final detection results.
Transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction
VoxelNet
1-Stage network
3D Voxels based. Uses 3D Convolution(inefficient)
Point RCNN
2-Stages network
Applies point cloud directly on Region Proposals
AVOD, F-PointNet: Top-down. Creates ROI then uses point.
PointRCNN: Bottom-Up. Uses point to get ROI.
In 2D image network, two-stage network generates proposals first then refines the proposal in the second stage.
Direct extension of the two-stage methods from 2D to 3D is non-trivial due to the huge 3D search space and the irregular format of point clouds.
See Comparison with AVOD and F-Pointnet. (Top-down manner)
This paper generates 3D proposals in a bottom-up manner.
Specifically, we learn point-wise features to segment the raw point cloud and to generate 3D proposals from the segmented foreground points simultaneously.
To discriminative point-wise features for describing the raw point clouds, we utilize the PointNet++ as the backbone network.
The foreground segmentation and 3D box proposal generation are performed simultaneously.
For point segmentation, the ground-truth segmentation mask is naturally provided by the 3D ground-truth boxes.
Training 시, 3D bounding box Ground truth 정보로 Foreground segmentation를 적용하여바로 foreground point를 알 수 있음.
Inference 시에는 Foreground Point Segmentation--> Bin-based 3D Box generation 를 순차적으로 수행
Used Focal Loss for training: there exists imbalance between number of foregorund vs number of background.
During training, we only require the box regression head to regress 3D bounding box locations from foreground points.
A 3D bounding box is represented as (x; y; z; h;w; l; ) in the LiDAR coordinate system, where (x; y; z) is the object center location, (h;w; l) is the object size, and is the object orientation from the bird’s view
we propose bin-based regression losses for estimating 3D bounding boxes of objects.
split the surrounding area of each foreground point into a series of discrete bins along the X and Z axes.
Enlarged bounding box to encode additional context information. Pooled points in ROI uses 1) coordinate, 2) intensity, 3) semantic mask, 4) semantic features.
From the proposed 3D box, enlarge the size by and keep the points(coordinates, reflection, segmentation mask, semantic features) inside the enlarged box.
Canonical coordinate system for one 3D box enables the box refinement stage to learn better local spatial features for each proposal.
(1) the origin is located at the center of the box proposal;
(2) the local X0 and Z0 axes are approximately parallel to the ground plane with X0 pointing towards the head direction of proposal and the other Z0 axis perpendicular to X0;
(3) the Y 0 axis remains the same as that of the LiDAR coordinate system
Concatenation of
(1) Canonical Transformed local spatial points
(2) Extra features (reflection, segmentation mask, Euclidean distance of box from origin)
(3) global sematic feature f(p) from Stage-1