Fs-detr: Few-shot detection transformer with prompting and without re-training
Abstract
This paper is on Few-Shot Object Detection (FSOD),
where given a few templates (examples) depicting a novel
class (not seen during training), the goal is to detect all
of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following
desiderata: (a) it must be used as is, without requiring any
fine-tuning at test time, (b) it must be able to process an arbitrary number of novel objects concurrently while supporting
an arbitrary number of examples from each class and (c) it
must achieve accuracy comparable to a closed system. Towards satisfying (a)-(c), in this work, we make the following
contributions: We introduce, for the first time, a simple, yet
powerful, few-shot detection transformer (FS-DETR) based
on visual prompting that can address both desiderata (a) and
(b). Our system builds upon the DETR framework, extending it based on two key ideas: (1) feed the provided visual
templates of the novel classes as visual prompts during test
time, and (2) “stamp” these prompts with pseudo-class embeddings (akin to soft prompting), which are then predicted
at the output of the decoder. Importantly, we show that
our system is not only more flexible than existing methods,
but also, it makes a step towards satisfying desideratum (c).
Specifically, it is significantly more accurate than all methods that do not require fine-tuning and even matches and
outperforms the current state-of-the-art fine-tuning based
methods on the most well-established benchmarks (PASCAL
VOC & MSCOCO).