Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

1 Introduction

The framework directly regresses 3D bounding bo xes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance.

It consists of a backbone network followed by two parallel network branches for

1) bounding box regression and

2) point mask prediction.

3D-BoNet is single-stage, anchor-free and end-to-end trainable

it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting.

Core problems on 3D geometric data such as point clouds include semantic segmentation, object detection and instance segmentation.

Point clouds are inherently unordered, unstructured and non-uniform.

SGPN：learns to group per-point features through a similarity matrix

Similarly，ASIS、JSIS3D、MASC、3D-BEVIS apply the same per-point feature grouping pipeline to segment 3D instances.

Mo：formulate the instance segmentation as a per-point feature classifification problem in PartNet

These methods above do not explicitly detect the object boundaries, and require a post-processing step such as mean-shift clustering [6] to obtain the fifinal instance labels, which is computationally heavy.

The paper propose a framework for 3D instance segmentation, where objects are loosely but uniquely detected through a single-forward stage using effificient MLPs, and then each instance is precisely segmented through a simple point-level binary classififier. To this end, we introduce a new bounding box prediction module together with a series of carefully designed loss functions to directly learn object boundaries.

It first uses an existing backbone network to extract a local feature vector for each point and a global feature vector for the whole input point cloud.

The backbone is followed by two branches:

1) instance-level bounding box prediction

2) point-level mask prediction for instance segmentation.

The bounding box prediction branch aims to predict a bounding box for each instance without relying on predefined spatial anchors or a region proposal network.

to learn instance boxes involves critical issues:

1) the number of total instances is variable, i.e., from 1 to many,

2) there is no fifixed order for all instances.

This box prediction branch simply takes the global feature vector as input and directly outputs a large and fifixed number of bounding boxes together with confifidence scores. These scores are used to indicate whether the box contains a valid instance or not. To supervise the network, we design a novel bounding box association layer followed by a multi-criteria loss function.

Given a set of ground-truth instances, we need to determine which of the predicted boxes best fifit them. We formulate this association process as an optimal assignment problem with an existing solver.

After the boxes have been optimally associated, our multi-criteria loss function not only minimizes the Euclidean distance of paired boxes, but also maximizes the coverage of valid points inside of predicted boxes.

The purpose of the point mask prediction branch is to classify whether each point inside of a bounding box belongs to the valid instance or the background.

The framework distinguishes from all existing 3D instance segmentation approaches:

.1) Compared with the proposal-free pipeline, our method segments instance with high objectness by explicitly learning 3D object boundaries.

2) Compared with the widely-used proposal-based approaches, our framework does not require expensive and dense proposals.

3) Our framework is remarkably effificient, since the instance-level masks are learnt in a single-forward pass without requiring any post-processing steps.

key contributions:

• We propose a new framework for instance segmentation on 3D point clouds. The framework is single-stage, anchor-free and end-to-end trainable, without requiring any post-processing steps.

• We design a novel bounding box association layer followed by a multi-criteria loss function to supervise the box prediction branch.

• We demonstrate signifificant improvement over baselines and provide intuition behind our design choices through extensive ablation studies.

2 3D-BoNet

2.1 Overview

input：point cloud P with N points in total

is the number of channels such as the location

and color

of each point

，k is the length of feature vectors.

During training, the predicted bounding boxes B and the ground truth boxes are fed into a box association layer.

The output of the association layer is a list of association index A

The indices reorganize the predicted boxes, such that each ground truth box is paired with a unique predicted box for subsequent loss calculation.

The predicted bounding box scores are also reordered accordingly before calculating loss.

The reordered predicted bounding boxes are then fed into the multi-criteria loss function.

This loss function aims to not only minimize the Euclidean distance between each ground truth box and the associated predicted box, but also maximize the coverage of valid points inside of each predicted box.

Both the bounding box association layer and multi-criteria loss function are only designed for network training

In order to predict point-level binary mask for each instance, every predicted box together with previous local and global features, i.e., Fl and Fg, are further fed into the point mask prediction branch.

2.2 Bounding Box Prediction

Bounding Box Encoding：

Neural Layers:

H is a predefined and fixed number of bounding boxes that the whole network are expected to predict in maximum.

Bounding Box Association Layer:

Optimal Association Formulation:

A ： a boolean association matrix where

if the i th predicted box is assigned to the j th ground truth box，also calles association index.

C：the association cost matrix where Ci,j represents the cost that the i th predicted box is assigned to the j th ground truth box

Association Matrix Calculation:

1、Euclidean Distance between Vertices

2、Soft Intersection-over-Union on Points

The deeper the corresponding point is inside of the box, the higher the value. The farther away the point is outside, the smaller the value.

3、Cross-Entropy Score

the criterion (1) guarantees the geometric boundaries for learnt boxes and criteria (2)(3) maximize the coverage of valid points and overcome the non-uniformity

Loss Functions

Multi-criteria Loss for Box Prediction:

Loss for Box Score Prediction:

After being reordered by the association index A, the ground truth scores for the fifirst T scores are all ‘1’, and ‘0’ for the remaining invalid H H T scores. Use cross-entropy loss for this binary classifification task:

This loss function rewards the correctly predicted bounding boxes, while implicitly penalizing the cases where multiple similar

boxes are regressed for a single instance.

2.3 Point Mask Prediction

Neural Layers:

use sigmoid as the last activation function

Loss Function:

Due to the imbalance of instance and background point numbers, we use focal loss [29] with default hyper-parameters instead of the standard cross-entropy loss to optimize this branch. Only the valid T paired masks are used for the loss

2.4 End-to-End Implementation

backbone：PointNet++

Adam

Initial learning rate is set to

and then divided by 2 every 20 epochs.

Hungarian algorithm ：to solve the above optimal association problem

Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

工作中用到的腳本合集

24-5-18 X

《python機器學習及實踐_從零開始通往kaggle競賽之路》——讀書筆記

論文閱讀（1） —— Character Region Awareness for Text Detection

機器學習（3） -- 線性模型

手寫PCA -- 人臉重建

機器學習（15） -- 規則學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結