Vocal imitation for query by vocalisation

Mehrabi, Adib

View/Open

MEHRABI_Adib_PhD_Final_230418.pdf (5.797Mb)

Publisher

Queen Mary University of London

Metadata

Show full item record

Abstract

The human voice presents a rich and powerful medium for expressing sonic ideas such as musical sounds. This capability extends beyond the sounds used in speech, evidenced for example in the art form of beatboxing, and recent studies highlighting the utility of vocal imitation for communicating sonic concepts. Meanwhile, the advance of digital audio has resulted in huge libraries of sounds at the disposal of music producers and sound designers. This presents a compelling search problem: with larger search spaces, the task of navigating sound libraries has become increasingly difficult. The versatility and expressive nature of the voice provides a seemingly ideal medium for querying sound libraries, raising the question of how well humans are able to vocally imitate musical sounds, and how we might use the voice as a tool for search. In this thesis we address these questions by investigating the ability of musicians to vocalise synthesised and percussive sounds, and evaluate the suitability of different audio features for predicting the perceptual similarity between vocal imitations and imitated sounds. In the fi rst experiment, musicians were tasked with imitating synthesised sounds with one or two time{varying feature envelopes applied. The results show that participants were able to imitate pitch, loudness, and spectral centroid features accurately, and that imitation accuracy was generally preserved when the imitated stimuli combined two, non-necessarily congruent features. This demonstrates the viability of using the voice as a natural means of expressing time series of two features simultaneously. The second experiment consisted of two parts. In a vocal production task, musicians were asked to imitate drum sounds. Listeners were then asked to rate the similarity between the imitations and sounds from the same category (e.g. kick, snare etc.). The results show that drum sounds received the highest similarity ratings when rated against their imitations (as opposed to imitations of another sound), and overall more than half the imitated sounds were correctly identi ed with above chance accuracy from the imitations, although this varied considerably between drum categories. The fi ndings from the vocal imitation experiments highlight the capacity of musicians to vocally imitate musical sounds, and some limitations of non- verbal vocal expression. Finally, we investigated the performance of different audio features as predictors of perceptual similarity between the imitations and imitated sounds from the second experiment. We show that features learned using convolutional auto-encoders outperform a number of popular heuristic features for this task, and that preservation of temporal information is more important than spectral resolution for differentiating between the vocal imitations and same-category drum sounds.

Authors

Mehrabi, Adib

URI

http://qmro.qmul.ac.uk/xmlui/handle/123456789/36693

Collections

Theses [4121]

Licence information

The copyright of this thesis rests with the author and no quotation from it or information derived from it may be published without the prior written consent of the author