Frustrum-PointNet
Last updated
Last updated
Qi, Charles R., Liu Wei, and Wu Chenxia. "Frustum PointNets for 3D object detection from RGB-D data [C/OL]." Computer Vision Pattern Recog (2017): 11-22. https://arxiv.org/pdf/1711.08488.pdf
Authors of PointNet, PointNet++
Github: https://github.com/charlesq34/frustum-pointnets
Explores how to extend the architecture of PointNet for 3D object Detection.
Uses 2D RGB image & Depth point cloud
Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects.
One key challenge: how to efficiently propose possible locations of 3D objects in a 3D space ?
Option 1: 3D box candidates by sliding windows
Option 2: 3D region proposal network
But these options of 3D search is computationally expensive.
Proposed: Reduce the search space by taking the advantage of mature 2D object detectors
First, we extract the 3D bounding frustum of an object by extruding 2D bounding boxes from image detectors.
Then, within the 3D space trimmed by each of the 3D frustums, we consecutively perform 3D object instance segmentation and amodal 3D bounding box regression using two variants of Point- Net.
The segmentation network predicts the 3D mask of the object of interest (i.e. instance segmentation); and the regression network estimates the amodal 3D bounding box
In contrast to treating Depth Data as 2D maps for CNNs, our method is more 3D-centric as we lift depth maps to 3D point clouds and process them using 3D tools.
a few transformations are applied successively on 3D coordinates, which align point clouds into a sequence of more constrained and canonical frames.
learning in 3D space can better exploits the geometric and topological structure of 3D space
With a known camera projection matrix, a 2D bounding box can be lifted to a frustum (with near and far planes specified by depth sensor range) that defines a 3D search space for the object. We then collect all points within the frustum to form a frustum point cloud.
This paper used FPN
trained on ImageNet classification and COCO object detection dataset, fined tuned with KITTI 2D object detection dataset
Orientation Normalization
Normalize the frustums by rotating them toward a center view such that the center axis of the frustum is orthogonal to the image plane. This normalization helps improve the rotation-invariance of the algorithm.
How to obtain the 3D location of the object?
Option 1: regress 3D object locations (e.g., by 3D bounding box) from a depth map using 2D CNNs
not easy as occluding objects and background clutter is common in natural scenes
It is easier to segment points in 3D point cloud than that in 2D images where pixels from distant objects can be near-by to each other or Depthmap(2D image).
The network takes a point cloud in frustum and predicts a probability score for each point that indicates how likely the point belongs to the object of interest.
Each frustum contains exactly one object of interest.
Rather than regressing the absolute 3D location of the object whose offset from the sensor may vary in large ranges (e.g. from 5m to beyond 50m in KITTI data), we predict the 3D bounding box center in a local coordinate system – 3D mask coordinates as shown in Fig. 4 (c).
Our segmentation PointNet is learning the occlusion and clutter patterns as well as recognizing the geometry for the object of a certain category.
For example, if we know the object of interest is a pedestrian, then the segmentation network can use this prior to find geometries that look like a person.
Specifically, in our architecture we encode the semantic category as a one-hot class vector (k dimensional for the pre-defined k categories) and concatenate the one-hot vector to the intermediate point cloud features.
After 3D instance segmentation, points that are classified as the object of interest are extracted (“masking” in Fig. 2).
Further normalize its coordinates to boost the translational invariance of the algorithm
we transform the point cloud into a local coordinate by subtracting XYZ values by its centroid.Fig. 4 (c).
This module estimates the object’s amodal oriented 3D bounding box by using a box regression PointNet together with a preprocessing transformer network.
We find that the origin of the mask coordinate frame (Fig. 4 (c)) may still be quite far from the amodal box center. We therefore propose to use a light-weight regression PointNet (T-Net) to estimate the true center of the complete object and then transform the coordinate such that the predicted center becomes the origin (Fig. 4 (d)).
We explicitly supervise our translation network to predict center residuals from the mask coordinate origin to real object center.
The box estimation network predicts amodal bounding boxes from the 3D object coordinate Fig 4(d).
Our model will both classify size/heading (NS scores for size, NH scores for heading) to those pre-defined categories as well as predict residual numbers for each category (3NS residual dimensions for height, width, length, NH residual angles for heading). In the end the net outputs 3 + 4 NS + 2 NH numbers in total.
Simultaneously optimize the three nets involved (3D instance segmentation PointNet, T-Net and amodal box estimation PointNet) with multi-task losses
where
the corner loss is the sum of the distances between the eight corners of a predicted box and a ground truth box
Output: 7 parameters of a 3D bounding box by its center (cx, cy, cz), size (h, w, l) and heading angle (along up-axis).