Vocal imitation for query by vocalisation
Publisher
Metadata
Show full item recordAbstract
The human voice presents a rich and powerful medium for expressing sonic
ideas such as musical sounds. This capability extends beyond the sounds used
in speech, evidenced for example in the art form of beatboxing, and recent
studies highlighting the utility of vocal imitation for communicating sonic concepts.
Meanwhile, the advance of digital audio has resulted in huge libraries of
sounds at the disposal of music producers and sound designers. This presents
a compelling search problem: with larger search spaces, the task of navigating
sound libraries has become increasingly difficult. The versatility and expressive
nature of the voice provides a seemingly ideal medium for querying sound
libraries, raising the question of how well humans are able to vocally imitate
musical sounds, and how we might use the voice as a tool for search. In this
thesis we address these questions by investigating the ability of musicians to
vocalise synthesised and percussive sounds, and evaluate the suitability of different
audio features for predicting the perceptual similarity between vocal
imitations and imitated sounds.
In the fi rst experiment, musicians were tasked with imitating synthesised
sounds with one or two time{varying feature envelopes applied. The results
show that participants were able to imitate pitch, loudness, and spectral centroid
features accurately, and that imitation accuracy was generally preserved
when the imitated stimuli combined two, non-necessarily congruent features.
This demonstrates the viability of using the voice as a natural means of
expressing time series of two features simultaneously.
The second experiment consisted of two parts. In a vocal production task,
musicians were asked to imitate drum sounds. Listeners were then asked to
rate the similarity between the imitations and sounds from the same category
(e.g. kick, snare etc.). The results show that drum sounds received the highest
similarity ratings when rated against their imitations (as opposed to imitations
of another sound), and overall more than half the imitated sounds were
correctly identi ed with above chance accuracy from the imitations, although
this varied considerably between drum categories.
The fi ndings from the vocal imitation experiments highlight the capacity
of musicians to vocally imitate musical sounds, and some limitations of non-
verbal vocal expression. Finally, we investigated the performance of different
audio features as predictors of perceptual similarity between the imitations and
imitated sounds from the second experiment. We show that features learned
using convolutional auto-encoders outperform a number of popular heuristic
features for this task, and that preservation of temporal information is more
important than spectral resolution for differentiating between the vocal imitations
and same-category drum sounds.
Authors
Mehrabi, AdibCollections
- Theses [4121]