Prompting Visual-Language Models for Dynamic Facial Expression Recognition
Metadata
Show full item recordAbstract
This paper presents a novel visual-language model called DFER-CLIP, which is
based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recognition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a
textual part. For the visual part, based on the CLIP image encoder, a temporal model
consisting of several Transformer encoders is introduced for extracting temporal facial
expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that
is related to the classes (facial expressions) that we are interested in recognising – those
descriptions are generated using large language models, like ChatGPT. This, in contrast
to works that use only the class names and more accurately captures the relationship
between them. Alongside the textual description, we introduce a learnable token which
helps the model learn relevant context information for each expression during training.
Extensive experiments demonstrate the effectiveness of the proposed method and show
that our DFER-CLIP also achieves state-of-the-art results compared with the current supervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks. Code is
publicly available at https://github.com/zengqunzhao/DFER-CLIP.