OVTrack: Open-Vocabulary Multiple Object Tracking

TL;DR

First method and benchmark for open-vocabulary multi-object tracking.

Abstract

The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients. First, leveraging vision-language models for both classification and association via knowledge distillation. Second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images.

Video Overview

Data Generation Results

Generated vs. Original Generated vs. Original

TAO Open-vocabulary MOT [test set]

Method Classes Base Classes Novel Data LVIS Data TAO Base TETA Novel TETA
QDTrack 25.8 20.2
TETer 29.2 21.7
DeepSORT (ViLD) 24.5 17.2
Tracktor++ (ViLD) 26.0 18.0
OVTrack (Ours) 32.6 24.1

TAO Closed-set MOT [test set]

TETA benchmark

Method backbone TETA LocA AssocA ClsA
QDTrack(CVPR21) ResNet-101 30.0 50.5 27.4 12.1
TETer(ECCV22) ResNet-101 33.3 51.6 35.0 13.2
OVTrack (Ours) ResNet-50 34.7 49.3 36.7 18.1

TAO benchmark

TAO benchmark backbone Track AP50 Track AP75 Track AP
SORT-TAO (ECCV 20) ResNet-101 13.2 - -
QDTrack (CVPR21) ResNet-101 15.9 5 10.6
GTR (CVPR 2022) ResNet-101 20.4 - -
TAC (ECCV 2022 ) ResNet-101 17.7 5.8 7.3
BIV (ECCV 2022) ResNet-101 19.6 7.3 13.6
OVTrack (Ours) ResNet-50 21.2 10.6 15.9

BibTeX

@InProceedings{ovtrack,
    author    = {Li, Siyuan and Fischer, Tobias and Ke, Lei and Ding, Henghui and Danelljan, Martin and Yu, Fisher},
    title     = {OVTrack: Open-Vocabulary Multiple Object Tracking},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {5567-5577}
}
Tobias Fischer
Tobias Fischer
Ph.D. Student in Computer Vision

Tobias Fischer is a Ph.D. student at the Computer Vision and Geometry Group of ETH Zürich.