MusCaps: generating captions for music audio

Manco, I; Benetos, E; Quinton, E; Fazekas, G; International Joint Conference on Neural Networks (IJCNN)

dc.contributor.author	Manco, I
dc.contributor.author	Benetos, E
dc.contributor.author	Quinton, E
dc.contributor.author	Fazekas, G
dc.contributor.author	International Joint Conference on Neural Networks (IJCNN)
dc.date.accessioned	2021-05-25T15:07:11Z
dc.date.available	2021-04-10
dc.date.available	2021-05-25T15:07:11Z
dc.date.issued	2021-07-18
dc.identifier.uri	https://qmro.qmul.ac.uk/xmlui/handle/123456789/72068
dc.description.abstract	Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs through a multimodal encoder and leverages pre-training on audio data to obtain representations that effectively capture and summarise musical features in the input. Evaluation of the generated captions through automatic metrics shows that our method outperforms a baseline designed for non-music audio captioning. Through an ablation study, we unveil that this performance boost can be mainly attributed to pre-training of the audio encoder, while other design choices – modality fusion, decoding strategy and the use of attention -- contribute only marginally. Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval.	en_US
dc.format.extent	? - ? (8)
dc.publisher	IEEE	en_US
dc.title	MusCaps: generating captions for music audio	en_US
dc.type	Conference Proceeding	en_US
pubs.author-url	https://ilariamanco.com/	en_US
pubs.notes	Not known	en_US
pubs.publication-status	Accepted	en_US
pubs.publisher-url	https://www.ijcnn.org/	en_US
dcterms.dateAccepted	2021-04-10
qmul.funder	UKRI Centre for Doctoral Training in Artificial Intelligence and Music::Engineering and Physical Sciences Research Council	en_US

Files in this item

Name:: Benetos MusCaps: generating ...
Size:: 1.349Mb
Format:: application/
Description:: Accepted version

View/Open

This item appears in the following Collection(s)

Electronic Engineering and Computer Science [3475]

Show simple item record