PointPillars

PointPillar

PointPillars: Fast Encoders for Object Detection From Point Clouds

Lang, Alex H., et al. "Pointpillars: Fast encoders for object detection from point clouds." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

Paper link

Github: https://github.com/nutonomy/second.pytorch

A method for object detection in 3D that enables end-to-end learning with only 2D convolutional layers.

Utilized PointNets, uses only LiDAR.

method for object detection in 3D that enables end-to-end learning with only 2D convolutional layers.

uses encoder that learns features on pillars (vertical columns) of the point cloud to predict 3D oriented boxes for objects.

Advantages

by learning features instead of relying on fixed encoders, PointPillars can leverage the full information represented by the point cloud.
Further, by operating on pillars instead of voxels there is no need to tune the binning of the vertical direction by hand.
Finally, pillars are fast because all key operations can be formulated as 2D convolutions which are extremely efficient to compute on a GPU. Run at 62Hz~105Hz

Contributions

a point cloud encoder and network that operates on the point cloud to enable end-to-end training of a 3D object detection network.
We show how all computations on pillars can be posed as dense 2D convolutions which enables inference at 62 Hz; a factor of 2-4 times faster than other methods.
We conduct experiments on the KITTI dataset and demonstrate state of the art results on cars, pedestrians, and cyclists on both BEV and 3D benchmarks.

Network

It consists of three main stages (Figure 2):

A feature encoder network that converts a point cloud to a sparse pseudoimage
a 2D convolutional backbone to process the pseudo-image into high-level representation
a detection head that detects and regresses 3D boxes.

Feature Encoder (Pillar feature net):

Used pillar instead of voxels to avoid using 3D Conv.

Converts the point cloud into a sparse pseudo image. First, the point cloud is divided into grids in the x-y coordinates, creating a set of pillars. Each point in the cloud, which is a 4-dimensional vector (x,y,z, reflectance), is converted to a 9-dimensional vector containing the additional information explained as follows:

Xc, Yc, Zc = Distance from the arithmetic mean of the pillar c the point belongs to in each dimension.
Xp, Yp = Distance of the point from the center of the pillar in the x-y coordinate system.

Hence, a point now contains the information D = [x,y,z,r,Xc,Yc,Zc,Xp,Yp].

Images from here

Feature Encoder creates pillars on the point cloud. Then each point is converted to a 9-dimensional vector encapsulating information about the pillar it belongs to.

For each grid (k), zero padding is applied when Nk is smaller than N.

[Image from] https://becominghuman.ai/pointpillars-3d-point-clouds-bounding-box-detection-and-tracking-pointnet-pointnet-lasernet-67e26116de5a

It used a simplified PointNet to generate from (D, P,N) --> (C, P, N) --> max.pooling in N direction- -> (C,P)

This paper used N=8000, 12000, 16000, C=64

Backbone

An example of a backbone (RPN) Region Proposal Network used in Point Pillars. The image is taken from the VoxelNet paper which originally proposed this network.

The backbone constitutes of sequential 3D convolutional layers to learn features from the transformed input at different scales. The input to the RPN is the feature map provided by the Feature Net.

The network has three blocks of fully convolutional layers. The first layer of each block downsamples the feature map by half via convolution with a stride size of 2, followed by a sequence of convolutions of stride 1 (×q means q applications of the filter). After each convolution layer, BN and ReLU operations are applied. We then upsample the output of every block to a fixed size and concatenated to construct the high-resolution feature map.