# PointPillars

## PointPillar

PointPillars: Fast Encoders for Object Detection From Point Clouds

Lang, Alex H., et al. "Pointpillars: Fast encoders for object detection from point clouds." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

[**Paper link**](https://arxiv.org/pdf/1812.05784.pdf)

**Github:** <https://github.com/nutonomy/second.pytorch>

A method for object detection in 3D that enables end-to-end learning with only 2D convolutional layers.

Utilized PointNets, uses only LiDAR.

method for object detection in 3D that enables end-to-end learning with only 2D convolutional layers.

uses encoder that learns features on pillars (vertical columns) of the point cloud to predict 3D oriented boxes for objects.

**Advantages**

* by learning features instead of relying on fixed encoders, PointPillars can leverage the full information represented by the point cloud.
* Further, by operating on pillars instead of voxels there is no need to tune the binning of the vertical direction by hand.
* Finally, pillars are fast because all key operations can be formulated as 2D convolutions which are extremely efficient to compute on a GPU. Run at 62Hz\~105Hz

**Contributions**

* a point cloud encoder and network that operates on the point cloud to enable end-to-end training of a 3D object detection network.
* We show how all computations on pillars can be posed as dense 2D convolutions which enables inference at 62 Hz; a factor of 2-4 times faster than other methods.
* We conduct experiments on the KITTI dataset and demonstrate state of the art results on cars, pedestrians, and cyclists on both BEV and 3D benchmarks.

#### Network

It consists of three main stages (Figure 2):

1. A feature encoder network that converts a point cloud to a sparse **pseudoimage**
2. a 2D convolutional backbone to process the pseudo-image into high-level representation
3. a detection head that detects and regresses 3D boxes.

![](/files/-MfHf8Pue0YXFMbBzHC9)

## **Feature Encoder (Pillar feature net):**

Used pillar instead of voxels to avoid using 3D Conv.

Converts the point cloud into a sparse pseudo image. First, the point cloud is divided into grids in the x-y coordinates, creating a set of pillars. Each point in the cloud, which is a 4-dimensional vector (x,y,z, reflectance), is converted to a 9-dimensional vector containing the additional information explained as follows:

* Xc, Yc, Zc = Distance from the arithmetic mean of the pillar *c* the point belongs to in each dimension.
* Xp, Yp = Distance of the point from the center of the pillar in the x-y coordinate system.

Hence, a point now contains the information **D = \[x,y,z,r,Xc,Yc,Zc,Xp,Yp]**.

[Images from here](https://becominghuman.ai/pointpillars-3d-point-clouds-bounding-box-detection-and-tracking-pointnet-pointnet-lasernet-67e26116de5a )

![img](https://miro.medium.com/max/1050/1*Ub48-QRdruvuY__HxWqnJg.png)

![](/files/-MfHfb38HnjXeL3hz4yZ)

Feature Encoder creates pillars on the point cloud. Then each point is converted to a 9-dimensional vector encapsulating information about the pillar it belongs to.

![](/files/-MfHfeyqcqqHC3Kd4Idp)

For each grid (k), zero padding is applied when Nk is smaller than N.

\[Image from] <https://becominghuman.ai/pointpillars-3d-point-clouds-bounding-box-detection-and-tracking-pointnet-pointnet-lasernet-67e26116de5a>

It used a simplified PointNet to generate from (D, P,N) --> (C, P, N) --> max.pooling in N direction- -> (C,P)

This paper used N=8000, 12000, 16000, C=64

![](/files/-MfHfiz5VkZ7Zr-FSLxv)

## **Backbone**

![img](https://miro.medium.com/max/1050/1*u55iWRkyiqN2sD4dkAdNBw.png)

An example of a backbone (RPN) Region Proposal Network used in Point Pillars. The image is taken from the [VoxelNet](https://arxiv.org/pdf/1711.06396.pdf) paper which originally proposed this network.

The backbone constitutes of sequential 3D convolutional layers to learn features from the transformed input at different scales. The input to the RPN is the feature map provided by the *Feature Net*.

The network has three blocks of fully convolutional layers. The first layer of each block downsamples the feature map by half via convolution with a stride size of 2, followed by a sequence of convolutions of stride 1 (×q means q applications of the filter). After each convolution layer, BN and ReLU operations are applied. We then upsample the output of every block to a fixed size and concatenated to **construct the high-resolution feature map.**

#### Loss Function

The loss function is optimized using Adam. The total loss is

![](/files/-MfHfqDgO36vuJlEAnyN)

**Localization Regression**

The same loss function used in SECOND with parameters of (x, y, z,w, l, h, theta).

![](/files/-MfHfsgL4N1IvUbXzCy7)

**Classification loss**

used focal loss

![](/files/-MfHfw64Asy73zUssiLe)

## Performance

![](/files/-MfHgGFBcfm5xydCPEEC)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ykkim.gitbook.io/wiki/lidar/pointpillars.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
