Show simple item record

dc.contributor.authorLiang, J
dc.contributor.authorLiu, X
dc.contributor.authorLiu, H
dc.contributor.authorPhan, H
dc.contributor.authorBenetos, E
dc.contributor.authorPlumbley, M
dc.contributor.authorWang, W
dc.contributor.author24th Annual Conference of the International Speech Communication Association (INTERSPEECH)
dc.date.accessioned2023-06-05T10:03:55Z
dc.date.available2023-05-17
dc.date.available2023-06-05T10:03:55Z
dc.date.issued2023-08-20
dc.identifier.urihttps://qmro.qmul.ac.uk/xmlui/handle/123456789/88690
dc.description.abstractContrastive language-audio pretraining (CLAP) has become a new paradigm to learn audio concepts with audio-text pairs. CLAP models have shown unprecedented performance as zero-shot classifiers on downstream tasks. To further adapt CLAP with domain-specific knowledge, a popular method is to finetune its audio encoder with available labelled examples. However, this is challenging in low-shot scenarios, as the amount of annotations is limited compared to the model size. In this work, we introduce a Training-efficient (Treff) adapter to rapidly learn with a small set of examples while maintaining the capacity for zero-shot classification. First, we propose a cross-attention linear model (CALM) to map a set of labelled examples and test audio to test labels. Second, we find initialising CALM as a cosine measurement improves our Treff adapter even without training. The Treff adapter outperforms metric-based methods in few-shot settings and yields competitive results to fully-supervised methods.en_US
dc.format.extent? - ? (5)
dc.relation.isreplacedby123456789/88692
dc.relation.isreplacedbyhttps://qmro.qmul.ac.uk/xmlui/handle/123456789/88692
dc.titleAdapting Language-Audio Models as Few-Shot Audio Learnersen_US
dc.typeConference Proceedingen_US
pubs.notesNot knownen_US
pubs.publication-statusAccepteden_US
dcterms.dateAccepted2023-05-17


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record