# VoxelNet

Zhou, Yin, and Oncel Tuzel. "Voxelnet: End-to-end learning for point cloud based 3d object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

{% embed url="<https://arxiv.org/pdf/1711.06396.pdf>" %}

## Introduction

To interface a highly sparse LiDAR point cloud with a **region proposal network (RPN),** most existing efforts have focused on *hand-crafted* feature representations

* for example, a bird’s eye view projection.

We remove the need of manual feature engineering for 3D point clouds and propose VoxelNet,

Aa generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network.

VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer.

### Contribution

* directly operates on sparse 3D points and avoids information bottlenecks introduced by manual feature engineering.
* an efficient method to implement VoxelNet which benefits both from the sparse point structure and efficient parallel processing on the voxel grid.

## Architecture

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-9ce7c0b0cf655849d380a514d621c1f13e31da52%2Fimage.png?alt=media)

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-6ca25650c002aa4330616f28616b2822ce40797b%2Fimage.png?alt=media)

**Voxel feature encoding (VFE) layer** enables inter-point interaction within a voxel, by combining point-wise features with a locally aggregated feature.

Stacking multiple VFE layers allows learning complex features for characterizing local 3D shape information.

Specifically, VoxelNet divides the point cloud into equally spaced 3D voxels, encodes each voxel via stacked VFE layers, and then 3D convolution further aggregates local voxel features, transforming the point cloud into a high-dimensional volumetric representation.

Finally, a RPN consumes the volumetric representation and yields the detection result. This efficient algorithm benefits both from the sparse point structure and efficient parallel processing on the voxel grid.

### 1. Feature Learning Network

#### Voxel Partition and Grouping:

Input 3D data of (D, H, W). Partition input data by 3D Voxel grid of size vD, vH, vW.

> This is not projected bird eye view

Group the 3D ponits according to the voxel they reside in.

#### Random Sampling

one scan frame contains \~100k points and processing all these points cost a very heavy computation

Randomly sample number T points from the voxels that contain higher than T points.

* To decrease computational cost
* decrease imbalance of points between the voxels to reduce sampling bias

#### Stacked Voxel Feature Encoding

The key innovation is the chain of VFE layers.

For each Voxels, repeat the following process

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-a70b6a19144e629e724e8bd41e14b1d1204ef2dd%2Fimage.png?alt=media)

The final output after all voxel repetition is **sparse 4D tensor : CxD'xH'xW'**

### 2. Covolutional Middle Layer

We use ConvMD(cin; cout; k; s; p) to represent an M- dimensional convolution operator where cin and cout are the number of input and output channels, k, s, and p are the M-dimensional vectors corresponding to kernel size, stride size and padding size respectively

* e.g. **k** = (k; k; k) for 3D

Each convolutional middle layer applies 3D convolution(BN, ReLU),

> 3D convolution is a slow process, bottleneck. See PointPillarNet

The convolutional middle layers aggregate voxel-wise features within a progressively expanding receptive field, adding more context to the shape description.

### 3. Region Proposal Network

See Faster R-CNN. A modification to Region proposal network of Faster R-CNN.

![](https://3698175758-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MAwtzMy_pbrChIExFtN%2Fuploads%2Fgit-blob-c914c561b96aafec6b5584b6680031f7beaccbbb%2Fimage.png?alt=media)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ykkim.gitbook.io/wiki/lidar/voxelnet.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
