CenterPoint Center-based 3D Object Detection and Tracking

Task: 3D Object Detection & Tracking (LiDAR)
Method: center-based, anchor-free, two-stage
Venue: CVPR
Year: 2021
Paper: https://arxiv.org/abs/2006.11275
Code: https://github.com/tianweiy/CenterPoint

摘要

3D 点云检测主流方法沿用 2D 检测范式：anchor boxes + NMS + IoU 匹配。本文提出 CenterPoint，将 CenterNet 的中心点表示迁移至 3D LiDAR 检测与跟踪。CenterPoint 在 BEV 特征图上用 center heatmap 定位物体中心，回归 3D 属性（尺寸、朝向、速度），并通过第二阶段 point feature 细化 bbox。跟踪仅需最近邻匹配。CenterPoint 在 nuScenes 上达到 65.5 NDS / 58.0 mAP（检测）和 63.8 AMOTA（跟踪），在 Waymo 上 71.8 L2 mAPH（vehicle），均刷新当时 SOTA。

核心论点：3D 物体天然适合中心点表示——BEV 视角下 3D 框几乎不重叠、无视角歧义，中心点 + 属性回归不仅比 anchor+NMS 更简洁高效，而且直接输出 velocity 即可实现端到端跟踪。

问题与动机

主流 3D 检测 pipeline 存在多层冗余：

问题	具体表现
Anchor 设计复杂	需对每类预设朝向、尺寸的 anchor，参数依赖数据集统计
NMS 代价高	3D NMS 需计算旋转 IoU（比 2D 慢 ≈10×），阻碍端到端训练
角度回归困难	朝向 0°/360° 不连续导致回归不稳定，通常需 $\sin/\cos$ 编码或分类离散化
跟踪与检测分离	检测后需 Kalman Filter + 匈牙利匹配实现跟踪，增加系统复杂度和延迟
旋转 IoU 匹配不够鲁棒	长条形物体轻微偏移就可导致 IoU 骤降（细长 truck 尤为严重）

核心痛点：2D 范式的 anchor + NMS + IoU 匹配在 3D 空间不自然——BEV 下物体几乎不重叠、无透视遮挡，为何还要预设 anchor 和做 NMS？

核心洞察

洞察 1：BEV 下中心点表示天然无歧义

2D 检测中，物体严重遮挡，多个 anchor 可覆盖同一物体 → 必须 NMS。但 BEV 视角下：

物体不重叠：所有 3D box 在 BEV 上几乎没有重叠（nuScenes 仅 0.2%）
无视角依赖：BEV 中心唯一确定，不随相机位姿变化
无 anchor 参数：直接预测 heatmap peaks，无需预设类别/尺寸/朝向的 anchor 组合

$$\hat{Y}_{x,y,c} = \max\left(0, 1 - \frac{(x-\tilde{p}_x)^2 + (y-\tilde{p}_y)^2}{2\sigma_c^2}\right)$$

其中 $\tilde{p}$ 是 GT 3D center 在 BEV 上的投影，$\sigma_c$ 按物体尺寸自适应缩放。

洞察 2：两阶段 point feature 从根本上校正中心误差

单阶段从 heatmap peak 处直接回归所有属性，但 heatmap 分辨率有限（stride=8），中心定位存在量化误差。CenterPoint 在第二阶段从预测 box 的 8 个面中心 + 底面中心（共 9 点）提取 BEV 特征，通过 MLP 预测 bbox 修正量。

$$\text{refine}(\hat{b}) = \text{MLP}\left(\text{BilinearSample}\left(\mathcal{F}_{BEV},\; \{\hat{b}_{face_i}\}_{i=1}^{9}\right)\right)$$

为什么有效：

face center 点分布在物体边界和内部，提供丰富的几何信息
比 PointPillars 的 RoI-Pool + 全连接更轻量（9 点 → MLP，vs 整个 RoI grid）
在 nuScenes 上提升 +2.2 mAP / +2.0 NDS

洞察 3：速度预测 + 最近邻 = 零成本跟踪

检测头额外回归 2D BEV 速度 $\hat{\mathbf{v}} = (\hat{v}_x, \hat{v}_y)$，跟踪时只需将预测位置加上 velocity × Δt，然后按 BEV 距离做贪心最近邻匹配。

$$\hat{\mathbf{p}}^{(t)}_{projected} = \hat{\mathbf{p}}^{(t)} + \hat{\mathbf{v}}^{(t)} \cdot \Delta t$$

匹配完成 → 继承 ID。无 Kalman Filter、无 ReID、无匈牙利算法，推理开销 ≈ 0。

**对比 CenterTrack**：CenterTrack 需要输入前一帧图像和 heatmap（额外 4 通道 + 额外 forward pass 开销），CenterPoint 的速度是标量回归头，无需看到前一帧，开销为零。

要记住的 3 个数字：

65.5 NDS / 58.0 mAP：nuScenes 检测 SOTA
63.8 AMOTA：nuScenes 跟踪 SOTA
71.8 L2 mAPH：Waymo vehicle 检测 SOTA

方法设计

整体架构

                   LiDAR Point Cloud
                           ↓
              ┌──────────────────────┐
              │  3D Backbone         │
              │  (VoxelNet/PointPill)│
              └──────────────────────┘
                           ↓
                    BEV Feature Map
                    [H/8, W/8, C]
                           ↓
        ┌──────────┬─────────┬──────────┬──────────┐
        ↓          ↓         ↓          ↓          ↓
    Heatmap    Sub-voxel   Height     Size     Rotation  Velocity
    [H,W,K]   offset[2]    [1]       [3]     [sin,cos]   [2]
        ↓                                                  ↓
   detect peaks ←── gather attributes at peaks ──→  v·Δt → tracking
        ↓
  Stage-1 Proposals (center, z, size, yaw)
        ↓
  ┌──────────────────────────────────┐
  │   Stage 2: Point Feature Refine  │
  │   9 face-center points → MLP     │
  └──────────────────────────────────┘
        ↓
  Refined 3D Boxes + Scores + Track IDs

第一阶段检测头：

预测头	输出尺寸	功能
Center Heatmap	$H \times W \times K$	各类中心点概率
Sub-voxel Offset	$H \times W \times 2$	BEV 中心量化误差修正
Height-above-ground	$H \times W \times 1$	物体中心高度 $z$
3D Size	$H \times W \times 3$	$l, w, h$
Rotation	$H \times W \times 2$	$\sin\theta, \cos\theta$（无角度不连续）
Velocity	$H \times W \times 2$	BEV 速度 $(v_x, v_y)$

损失函数：

$$L = L_{heatmap}^{focal} + \lambda_{reg} \sum_{attr \in {off, z, size, rot, vel}} L_1^{attr}$$

关键代码（来源：center_head.py）

SepHead —— 每个属性独立回归头：

class SepHead(nn.Module):
    def __init__(self, in_channels, heads, head_conv=64, final_kernel=1):
        super().__init__()
        self.heads = heads
        for head in self.heads:
            classes, num_conv = self.heads[head]
            fc = Sequential()
            for i in range(num_conv - 1):
                fc.add(nn.Conv2d(in_channels, head_conv,
                    kernel_size=final_kernel, padding=final_kernel // 2, bias=True))
                fc.add(nn.BatchNorm2d(head_conv))
                fc.add(nn.ReLU())
            fc.add(nn.Conv2d(head_conv, classes,
                kernel_size=final_kernel, padding=final_kernel // 2, bias=True))
            self.__setattr__(head, fc)

    def forward(self, x):
        ret_dict = dict()
        for head in self.heads:
            ret_dict[head] = self.__getattr__(head)(x)
        return ret_dict

📄 点击展开后处理与跟踪核心逻辑

def predict(self, example):
    """Stage-1: heatmap → peaks → gather attributes"""
    preds_dict = self.forward(example)
    # Decode: 8 topK peaks per class → gather reg attributes
    batch_hm = torch.sigmoid(preds_dict['hm'])
    batch_reg = preds_dict['reg']      # sub-voxel offset
    batch_height = preds_dict['height']
    batch_rot = preds_dict['rot']      # [sin, cos]
    batch_dim = preds_dict['dim']      # [l, w, h]
    batch_vel = preds_dict['vel']      # [vx, vy]

    # topK from heatmap, then gather
    scores, inds, clses, ys, xs = _topk(batch_hm, K=500)
    # ...
    return boxes, scores, labels

def post_processing(self, batch_box_preds, batch_scores, batch_labels):
    """Stage-2: point-feature refinement + velocity-based tracking"""
    ret_list = []
    for box_preds, scores, labels in zip(
            batch_box_preds, batch_scores, batch_labels):
        # Score threshold filtering
        mask = scores > self.post_cfg.score_threshold
        box_preds = box_preds[mask]
        scores = scores[mask]
        labels = labels[mask]
        # No NMS needed for center-based detection!
        # (BEV peaks already non-overlapping)
        ret_list.append({
            'box3d_lidar': box_preds,
            'scores': scores,
            'label_preds': labels,
        })
    return ret_list

实验与分析

主要结果（nuScenes 测试集）

方法	mAP	NDS	AMOTA	Latency
PointPillars（anchor）	40.1	55.0	—	31ms
CBGS (VoxelNet+anchor)	52.8	63.3	—	78ms
CenterPoint-Pillar	50.3	60.2	—	31ms
CenterPoint-Voxel	58.0	65.5	63.8	75ms

CenterPoint 以零成本实现跟踪（AMOTA 63.8 远超 CenterTrack 的 28.3%），同时检测性能领先 anchor-based 方法 5.2 mAP。

主要结果（Waymo 验证集）

方法	Vehicle L2 mAPH	Ped L2 mAPH	Cyc L2 mAPH
PointPillars（anchor）	56.6	53.3	56.0
CenterPoint-Pillar	66.0	62.6	63.3
CenterPoint-Voxel	71.8	67.5	68.8

Waymo 上领先幅度更大：vehicle L2 mAPH +15.2 vs PointPillars。

消融实验（nuScenes 验证集）

配置	mAP	NDS	验证洞察
CenterPoint Stage-1 only	54.2	63.8	—
+ Stage-2 refine (9-point)	56.4	65.8	洞察 #2
Anchor-based (同 backbone)	53.3	63.5	洞察 #1
Center-based (同 backbone)	54.2	63.8	洞察 #1，+0.9 mAP

关键发现：

anchor-free vs anchor-based：相同 VoxelNet backbone 下 center-based 比 anchor-based 高 +0.9 mAP / +0.3 NDS，且无需 NMS（推理更快）
Second stage 贡献显著：+2.2 mAP / +2.0 NDS，尤其改善大型物体（truck +4.0, bus +3.2）
9-point 采样 vs 其他策略：9 面中心点比 4 角点（+0.8 mAP）、1 中心点（+1.4 mAP）都好；信息量与计算量的最佳平衡

跟踪消融

跟踪策略	AMOTA	验证
仅 IoU 匹配	57.3	—
仅位置匹配	61.9	位置 > IoU（对旋转更鲁棒）
位置 + velocity	63.8	洞察 #3

velocity 预测使 AMOTA 从 61.9 → 63.8（+1.9），核心收益来自对快速移动物体的关联改善。

工程实践

训练配置

Backbone: VoxelNet (sparse convolution, output stride=8)
Voxel Size: [0.075m, 0.075m, 0.2m] (nuScenes) / [0.1m, 0.1m, 0.15m] (Waymo)
BEV Range: [-54m, 54m] × [-54m, 54m] (nuScenes)
Batch: 4 GPUs × 4 samples
Optimizer: Adam (lr=1e-3, weight_decay=0.01, one-cycle scheduler)
Epochs: 20 (nuScenes) / 36 (Waymo)
CBGS: class-balanced group sampling (平衡各类样本量)
GT Augmentation: 从数据库粘贴 GT box 和内部点增加稀有类训练样本

复现要点

无 NMS：center heatmap 的 max-pooling（kernel=3）已等效 NMS，后处理无需额外 NMS。这是 CenterPoint 速度的关键来源。
$\sin/\cos$ 朝向编码：回归 $(\sin\theta, \cos\theta)$ 避免角度不连续，但需在 loss 中用 $\text{atan2}$ 恢复角度。注意 $\theta \in [-\pi, \pi]$。
CBGS 对 nuScenes 至关重要：nuScenes 各类样本极不均衡（car 占 60%+，bicycle 不到 1%），CBGS 将所有类平衡采样后 mAP 提升 ≈ 6 点。
二阶段训练：先训练 Stage-1 完成，冻结 backbone + Stage-1 头部，再单独训练 Stage-2 MLP 5 epochs。两阶段端到端训练反而精度更低。

研究启示

7.1 可迁移的思想

BEV center-based 范式：CenterPoint 证明 BEV 空间下 anchor-free 检测全面优于 anchor-based，直接影响了 BEVFusion 检测头的设计选择
Velocity-based 跟踪：将速度作为检测属性直接回归，消除了跟踪对外部运动模型的依赖，后续几乎所有 3D 检测器均提供 velocity 输出
二阶段 point-feature 细化：从预测 box 表面采样点提取特征做修正，比 RoI-Pool 轻量，被 TransFusion 等后续工作采用

7.2 方法局限

依赖 BEV 不重叠假设：CenterPoint 的无 NMS 设计假设 BEV 下物体不重叠。在极端密集场景（如堆叠货物）或垂直方向重叠（如立交桥上下层车辆）时，heatmap peak 可能合并
LiDAR 专用：检测头假设输入为 BEV 特征（从 3D 体素化得到），不直接适用于纯相机方案。需要先将图像特征转换到 BEV（如 LSS、BEVFormer），才能接入 CenterPoint 风格的检测头
贪心匹配非最优：与 CenterTrack 相同的局限——密集场景下贪心匹配可能产生次优关联

7.3 技术影响

确立了 BEV center-based 检测范式：几乎所有后续 LiDAR / 多模态检测器（TransFusion、BEVFusion、UniTR 等）均采用 CenterPoint 风格的 center heatmap 检测头
零成本跟踪将检测和跟踪统一：通过 velocity 回归，MOT 不再需要独立的跟踪模块，推动了检测-跟踪一体化趋势
继承自 CenterNet → CenterTrack 的中心点哲学：从 2D 检测 → 2D 跟踪 → 3D 检测+跟踪，中心点表示的范式在 CenterPoint 达到了 LiDAR 3D 任务的完整闭环