VoxelNet
Last updated
Last updated
Zhou, Yin, and Oncel Tuzel. "Voxelnet: End-to-end learning for point cloud based 3d object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations
for example, a bird’s eye view projection.
We remove the need of manual feature engineering for 3D point clouds and propose VoxelNet,
Aa generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network.
VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer.
directly operates on sparse 3D points and avoids information bottlenecks introduced by manual feature engineering.
an efficient method to implement VoxelNet which benefits both from the sparse point structure and efficient parallel processing on the voxel grid.
Voxel feature encoding (VFE) layer enables inter-point interaction within a voxel, by combining point-wise features with a locally aggregated feature.
Stacking multiple VFE layers allows learning complex features for characterizing local 3D shape information.
Specifically, VoxelNet divides the point cloud into equally spaced 3D voxels, encodes each voxel via stacked VFE layers, and then 3D convolution further aggregates local voxel features, transforming the point cloud into a high-dimensional volumetric representation.
Finally, a RPN consumes the volumetric representation and yields the detection result. This efficient algorithm benefits both from the sparse point structure and efficient parallel processing on the voxel grid.
Input 3D data of (D, H, W). Partition input data by 3D Voxel grid of size vD, vH, vW.
This is not projected bird eye view
Group the 3D ponits according to the voxel they reside in.
one scan frame contains ~100k points and processing all these points cost a very heavy computation
Randomly sample number T points from the voxels that contain higher than T points.
To decrease computational cost
decrease imbalance of points between the voxels to reduce sampling bias
The key innovation is the chain of VFE layers.
For each Voxels, repeat the following process
The final output after all voxel repetition is sparse 4D tensor : CxD'xH'xW'
We use ConvMD(cin; cout; k; s; p) to represent an M- dimensional convolution operator where cin and cout are the number of input and output channels, k, s, and p are the M-dimensional vectors corresponding to kernel size, stride size and padding size respectively
e.g. k = (k; k; k) for 3D
Each convolutional middle layer applies 3D convolution(BN, ReLU),
3D convolution is a slow process, bottleneck. See PointPillarNet
The convolutional middle layers aggregate voxel-wise features within a progressively expanding receptive field, adding more context to the shape description.
See Faster R-CNN. A modification to Region proposal network of Faster R-CNN.