Analysis, Disentanglement, and Conversion of Singing Voice Attributes
View/ Open
Metadata
Show full item recordAbstract
Voice conversion is a prominent area of research, which can typically be described as the replacement of acoustic cues that relate to the perceived identity of the voice. Over almost a decade, deep learning has emerged as a transformative solution for this multifaceted task, offering various advancements to address different conditions and challenges in the field. One intriguing avenue for researchers in the field of Music Information Retrieval is singing voice conversion - a task that has only been subjected to neural network analysis and synthesis techniques over the last four years. The conversion of various singing voice attributes introduces new considerations, including working with limited datasets, adhering to musical context restrictions and considering how expression in singing is manifested in such attributes. Voice conversion with respect to singing techniques, for example, has received little attention even though its impact on the music industry would be considerable and important. This thesis therefore delves into problems related to vocal perception, limited datasets, and attribute disentanglement in the pursuit of optimal performance for the conversion of attributes that are scarcely labelled, which are covered across three research chapters. The first of these chapters describes the collection of perceptual pairwise dissimilarity ratings for singing techniques from participants. These were subsequently subjected to clustering algorithms and compared against existing ground truth labels. The results confirm the viability of using existing singing technique-labelled datasets for singing technique conversion (STC) using supervised machine learning strategies. A dataset of dissimilarity ratings and timbral maps was generated, illustrating how register and gender conditions affect perception. The first of these chapters describes the collection of perceptual pairwise dissimilarity ratings for singing techniques from participants. These were subsequently subjected to clustering algorithms and compared against existing ground truth labels. The results confirm the viability of using existing singing technique-labelled datasets for singing technique conversion (STC) using supervised machine learning strategies. A dataset of dissimilarity ratings and timbral maps was generated, illustrating how register and gender conditions affect perception. In response to these findings, an adapted version of an existing voice conversion system in conjunction with an existing labelled dataset was developed. This served as the first implementation of a model for zero-shot STC, although it exhibited varying levels of success. An alternative method of attribute conversion was therefore considered as a means towards performing satisfactorily realistic STC. By refining ‘voice identity’ conversion for singing, future research can be conducted where this attribute, along with more deterministic attributes (such as pitch, loudness, and phonetics) can be disentangled from an input signal, exposing information related to unlabelled attributes. Final experiments in refining the task of voice identity conversion for the singing domain were conducted as a stepping stone towards unlabelled attribute conversion. By performing comparative analyses between different features, singing and speech domains, and alternative loss functions, the most suitable process for singing voice attribute conversion (SVAC) could be established. In summary, this thesis documents a series of experiments that explore different aspects of the singing voice and conversion techniques in the pursuit of devising a convincing SVAC system.
Authors
O'Connor, BCollections
- Theses [4200]