DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

Wu, Q; Yang, T; Liu, Z; Wu, B; Shan, Y; Chan, AB

dc.contributor.author	Wu, Q
dc.contributor.author	Yang, T
dc.contributor.author	Liu, Z
dc.contributor.author	Wu, B
dc.contributor.author	Shan, Y
dc.contributor.author	Chan, AB
dc.date.accessioned	2024-07-22T09:42:03Z
dc.date.available	2024-07-22T09:42:03Z
dc.date.issued	2023-08-22
dc.identifier.citation	Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan and A. B. Chan, "DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023, pp. 14561-14571, doi: 10.1109/CVPR52729.2023.01399. keywords: {Visualization;Video tracking;Object segmentation;Pattern recognition;Object tracking;Task analysis;Image reconstruction;Self-supervised or unsupervised representation learning},	en_US
dc.identifier.issn	1063-6919
dc.identifier.uri	https://qmro.qmul.ac.uk/xmlui/handle/123456789/98287
dc.description.abstract	In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNet-based MAE with 2× faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git.	en_US
dc.format.extent	14561 - 14571
dc.publisher	IEEE	en_US
dc.rights	© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
dc.title	DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks	en_US
dc.type	Conference Proceeding	en_US
dc.identifier.doi	10.1109/CVPR52729.2023.01399
pubs.notes	Not known	en_US
pubs.publication-status	Published	en_US
pubs.volume	2023-June	en_US
rioxxterms.funder	Default funder	en_US
rioxxterms.identifier.project	Default project	en_US
rioxxterms.funder.project	b215eee3-195d-4c4f-a85d-169a4331c138	en_US

Files in this item

Name:: DropMAE Masked Autoencoders with ...
Size:: 24.26Mb
Format:: application/
Description:: Accepted Version

View/Open

This item appears in the following Collection(s)

Electronic Engineering and Computer Science [3490]

Show simple item record