Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

Luo, D; Huang, J; Gong, S; Jin, H; Liu, Y

dc.contributor.author	Luo, D
dc.contributor.author	Huang, J
dc.contributor.author	Gong, S
dc.contributor.author	Jin, H
dc.contributor.author	Liu, Y
dc.date.accessioned	2024-06-25T08:08:31Z
dc.date.available	2024-06-25T08:08:31Z
dc.date.issued	2024-01-01
dc.identifier.uri	https://qmro.qmul.ac.uk/xmlui/handle/123456789/97660
dc.description.abstract	Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationships are available without fine-grained temporal annotations (weakly-supervised). Recently, the vision-language models (VLM) demonstrate a new transfer learning paradigm to benefit different vision tasks through the universal visual-textual correlations derived from large-scale vision-language pairwise web data, which has also shown benefits to VMR by fine-tuning in the target domains.In this work, we propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment, without the need for accessing the VMR data. To this end, we devise a conditional feature refinement module to generate boundary-aware visual features conditioned on text queries to enable better moment boundary understanding. Additionally, we design a bottom-up proposal generation strategy that mitigates the impact of domain discrepancies and breaks down complex-query retrieval tasks into individual action retrievals, thereby maximizing the benefits of VLM. Extensive experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm, especially in the novel-word and novel-location out-of-distribution setups.	en_US
dc.format.extent	5452 - 5461
dc.publisher	IEEE	en_US
dc.title	Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models	en_US
dc.type	Conference Proceeding	en_US
dc.rights.holder	© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
dc.identifier.doi	10.1109/WACV57701.2024.00538
pubs.notes	Not known	en_US
pubs.publication-status	Published	en_US
rioxxterms.funder	Default funder	en_US
rioxxterms.identifier.project	Default project	en_US

Files in this item

Name:: Gong Zero-Shot Video Moment 2024 ...
Size:: 794.4Kb
Format:: application/
Description:: Accepted version

View/Open

This item appears in the following Collection(s)

Electronic Engineering and Computer Science [3387]

Show simple item record