Self-Supervised Facial Representation Learning with Facial Region Awareness
View/ Open
Metadata
Show full item recordAbstract
Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit
various visual tasks. This paper asks this question: can
self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as
a whole, i.e., learning consistent facial representations at
the image-level, which overlooks the “consistency of local
facial representations” (i.e., facial regions like eyes, nose,
etc). In this work, we make a first attempt to propose a
novel self-supervised facial representation learning framework to learn consistent global and local facial representations, Facial Region Awareness (FRA). Specifically, we
explicitly enforce the consistency of facial regions by matching the local facial representations across views, which are
extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation, we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and
“facial mask embeddings” computed from learnable positional embeddings, which leverage the attention mechanism
to globally look up the facial image for facial regions. To
learn such heatmaps, we formulate the learning of facial
mask embeddings as a deep clustering problem by assigning
the pixel features from the feature maps to them. The transfer learning results on facial classification and regression
tasks show that our FRA outperforms previous pre-trained
models and more importantly, using ResNet as the unified
backbone for various tasks, our FRA achieves comparable
or even better performance compared with SOTA methods
in facial analysis tasks.