Show simple item record

dc.contributor.authorApostolidis, E
dc.date.accessioned2023-09-06T08:00:04Z
dc.date.available2023-09-06T08:00:04Z
dc.date.issued2023
dc.identifier.urihttps://qmro.qmul.ac.uk/xmlui/handle/123456789/90479
dc.description.abstractIn this thesis, we address the task of video summarization using unsupervised deep-learning architectures. Video summarization aims to generate a short summary by selecting the most informative and important frames (key-frames) or fragments (key-fragments) of the full-length video, and presenting them in temporally-ordered fashion. Our objective is to overcome observed weaknesses of existing video summarization approaches that utilize RNNs for modeling the temporal dependence of frames, related to: i) the small influence of the estimated frame-level importance scores in the created video summary, ii) the insufficiency of RNNs to model long-range frames' dependence, and iii) the small amount of parallelizable operations during the training of RNNs. To address the first weakness, we propose a new unsupervised network architecture, called AC-SUM-GAN, which formulates the selection of important video fragments as a sequence generation task and learns this task by embedding an Actor-Critic model in a Generative Adversarial Network. The feedback of a trainable Discriminator is used as a reward by the Actor-Critic model in order to explore a space of actions and learn a value function (Critic) and a policy (Actor) for video fragment selection. To tackle the remaining weaknesses, we investigate the use of attention mechanisms for video summarization and propose a new supervised network architecture, called PGL-SUM, that combines global and local multi-head attention mechanisms which take into account the temporal position of the video frames, in order to discover different modelings of the frames' dependencies at different levels of granularity. Based on the acquired experience, we then propose a new unsupervised network architecture, called CA-SUM, which estimates the frames' importance using a novel concentrated attention mechanism that focuses on non-overlapping blocks in the main diagonal of the attention matrix and takes into account the attentive uniqueness and diversity of the associated frames of the video. All the proposed architectures have been extensively evaluated on the most commonly-used benchmark datasets, demonstrating their competitiveness against other approaches and documenting the contribution of our proposals on advancing the current state-of-the-art on video summarization. Finally, we make a first attempt on producing explanations for the video summarization results. Inspired by relevant works in the Natural Language Processing domain, we propose an attention-based method for explainable video summarization and we evaluate the performance of various explanation signals using our CA-SUM architecture and two benchmark datasets for video summarization. The experimental results indicate the advanced performance of explanation signals formed using the inherent attention weights, and demonstrate the ability of the proposed method to explain the video summarization results using clues about the focus of the attention mechanism.en_US
dc.language.isoenen_US
dc.publisherQueen Mary University of Londonen_US
dc.titleVideo Summarization Using Unsupervised Deep Learningen_US
dc.typeThesisen_US
pubs.notesNot knownen_US
rioxxterms.funderDefault funderen_US
rioxxterms.identifier.projectDefault projecten_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

  • Theses [4235]
    Theses Awarded by Queen Mary University of London

Show simple item record