• Login
    JavaScript is disabled for your browser. Some features of this site may not work without it.
    MusCaps: generating captions for music audio 
    •   QMRO Home
    • School of Electronic Engineering and Computer Science
    • Electronic Engineering and Computer Science
    • MusCaps: generating captions for music audio
    •   QMRO Home
    • School of Electronic Engineering and Computer Science
    • Electronic Engineering and Computer Science
    • MusCaps: generating captions for music audio
    ‌
    ‌

    Browse

    All of QMROCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects
    ‌
    ‌

    Administrators only

    Login
    ‌
    ‌

    Statistics

    Most Popular ItemsStatistics by CountryMost Popular Authors

    MusCaps: generating captions for music audio

    View/Open
    Accepted version (1.349Mb)
    Pagination
    ? - ? (8)
    Publisher
    IEEE
    Publisher URL
    https://www.ijcnn.org/
    Metadata
    Show full item record
    Abstract
    Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs through a multimodal encoder and leverages pre-training on audio data to obtain representations that effectively capture and summarise musical features in the input. Evaluation of the generated captions through automatic metrics shows that our method outperforms a baseline designed for non-music audio captioning. Through an ablation study, we unveil that this performance boost can be mainly attributed to pre-training of the audio encoder, while other design choices – modality fusion, decoding strategy and the use of attention -- contribute only marginally. Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval.
    Authors
    Manco, I; Benetos, E; Quinton, E; Fazekas, G; International Joint Conference on Neural Networks (IJCNN)
    URI
    https://qmro.qmul.ac.uk/xmlui/handle/123456789/72068
    Collections
    • Electronic Engineering and Computer Science [2688]
    Twitter iconFollow QMUL on Twitter
    Twitter iconFollow QM Research
    Online on twitter
    Facebook iconLike us on Facebook
    • Site Map
    • Privacy and cookies
    • Disclaimer
    • Accessibility
    • Contacts
    • Intranet
    • Current students

    Modern Slavery Statement

    Queen Mary University of London
    Mile End Road
    London E1 4NS
    Tel: +44 (0)20 7882 5555

    © Queen Mary University of London.