VoxelNet

Zhou, Yin, and Oncel Tuzel. "Voxelnet: End-to-end learning for point cloud based 3d object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Introduction

To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations

  • for example, a bird’s eye view projection.

We remove the need of manual feature engineering for 3D point clouds and propose VoxelNet,

Aa generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network.

VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer.

Contribution

  • directly operates on sparse 3D points and avoids information bottlenecks introduced by manual feature engineering.

  • an efficient method to implement VoxelNet which benefits both from the sparse point structure and efficient parallel processing on the voxel grid.

Architecture

Voxel feature encoding (VFE) layer enables inter-point interaction within a voxel, by combining point-wise features with a locally aggregated feature.

Stacking multiple VFE layers allows learning complex features for characterizing local 3D shape information.

Specifically, VoxelNet divides the point cloud into equally spaced 3D voxels, encodes each voxel via stacked VFE layers, and then 3D convolution further aggregates local voxel features, transforming the point cloud into a high-dimensional volumetric representation.

Finally, a RPN consumes the volumetric representation and yields the detection result. This efficient algorithm benefits both from the sparse point structure and efficient parallel processing on the voxel grid.

1. Feature Learning Network

Voxel Partition and Grouping:

Input 3D data of (D, H, W). Partition input data by 3D Voxel grid of size vD, vH, vW.

This is not projected bird eye view

Group the 3D ponits according to the voxel they reside in.

Random Sampling

one scan frame contains ~100k points and processing all these points cost a very heavy computation

Randomly sample number T points from the voxels that contain higher than T points.

  • To decrease computational cost

  • decrease imbalance of points between the voxels to reduce sampling bias

Stacked Voxel Feature Encoding

The key innovation is the chain of VFE layers.

For each Voxels, repeat the following process

The final output after all voxel repetition is sparse 4D tensor : CxD'xH'xW'

2. Covolutional Middle Layer

We use ConvMD(cin; cout; k; s; p) to represent an M- dimensional convolution operator where cin and cout are the number of input and output channels, k, s, and p are the M-dimensional vectors corresponding to kernel size, stride size and padding size respectively

  • e.g. k = (k; k; k) for 3D

Each convolutional middle layer applies 3D convolution(BN, ReLU),

3D convolution is a slow process, bottleneck. See PointPillarNet

The convolutional middle layers aggregate voxel-wise features within a progressively expanding receptive field, adding more context to the shape description.

3. Region Proposal Network

See Faster R-CNN. A modification to Region proposal network of Faster R-CNN.

Last updated

Was this helpful?