Point RCNN

Shi, Shaoshuai, Xiaogang Wang, and Hongsheng Li. "Pointrcnn: 3d object proposal generation and detection from point cloud." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. https://arxiv.org/abs/1812.04244

Github: https://github.com/sshaoshuai/PointRCNN

Introduction

PointRCNN: 3D object detection from raw point cloud. The whole framework is composed of two stages:

  1. stage-1 for the bottom-up 3D proposal generation from points

    • Instead of using bird-eye view or voxels as previous models

  2. stage-2 for refining proposals in the canonical coordinates to obtain the final detection results.

    • Transforms the pooled points of each proposal to canonical coordinates to learn better local spatial features, which is combined with global semantic features of each point learned in stage-1 for accurate box refinement and confidence prediction

Comparison with VoxelNet

VoxelNet

  • 1-Stage network

  • 3D Voxels based. Uses 3D Convolution(inefficient)

Point RCNN

  • 2-Stages network

  • Applies point cloud directly on Region Proposals

Comparison with AVOD and Frustum-PointNet

AVOD, F-PointNet: Top-down. Creates ROI then uses point.

PointRCNN: Bottom-Up. Uses point to get ROI.

Architecture

1. Botton-Up 3D Proposal Generation

In 2D image network, two-stage network generates proposals first then refines the proposal in the second stage.

Direct extension of the two-stage methods from 2D to 3D is non-trivial due to the huge 3D search space and the irregular format of point clouds.

  • See Comparison with AVOD and F-Pointnet. (Top-down manner)

This paper generates 3D proposals in a bottom-up manner.

Specifically, we learn point-wise features to segment the raw point cloud and to generate 3D proposals from the segmented foreground points simultaneously.

Learning point cloud representations.

To discriminative point-wise features for describing the raw point clouds, we utilize the PointNet++ as the backbone network.

Foreground point segmentation.

The foreground segmentation and 3D box proposal generation are performed simultaneously.

For point segmentation, the ground-truth segmentation mask is naturally provided by the 3D ground-truth boxes.

Training 시, 3D bounding box Ground truth 정보로 Foreground segmentation를 적용하여바로 foreground point를 알 수 있음.

Inference 시에는 Foreground Point Segmentation--> Bin-based 3D Box generation 를 순차적으로 수행

Used Focal Loss for training: there exists imbalance between number of foregorund vs number of background.

Bin-based 3D bounding box generation.

During training, we only require the box regression head to regress 3D bounding box locations from foreground points.

A 3D bounding box is represented as (x; y; z; h;w; l; ) in the LiDAR coordinate system, where (x; y; z) is the object center location, (h;w; l) is the object size, and  is the object orientation from the bird’s view

we propose bin-based regression losses for estimating 3D bounding boxes of objects.

  • split the surrounding area of each foreground point into a series of discrete bins along the X and Z axes.

2. Point cloud region pooling

Enlarged bounding box to encode additional context information. Pooled points in ROI uses 1) coordinate, 2) intensity, 3) semantic mask, 4) semantic features.

From the proposed 3D box, enlarge the size by η\eta and keep the points(coordinates, reflection, segmentation mask, semantic features) inside the enlarged box.

3. Canonical 3D bounding box refinement

Canonical coordinate system for one 3D box enables the box refinement stage to learn better local spatial features for each proposal.

(1) the origin is located at the center of the box proposal;

(2) the local X0 and Z0 axes are approximately parallel to the ground plane with X0 pointing towards the head direction of proposal and the other Z0 axis perpendicular to X0;

(3) the Y 0 axis remains the same as that of the LiDAR coordinate system

Feature Learning for Box Proposal Refinement

Concatenation of

  • (1) Canonical Transformed local spatial points

  • (2) Extra features (reflection, segmentation mask, Euclidean distance of box from origin)

  • (3) global sematic feature f(p) from Stage-1

Last updated

Was this helpful?