dc.contributor.author | Luo, Y-J | |
dc.contributor.author | Dixon, S | |
dc.contributor.author | ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | |
dc.date.accessioned | 2024-07-09T10:25:12Z | |
dc.date.available | 2024-07-09T10:25:12Z | |
dc.date.issued | 2024-04-14 | |
dc.identifier.uri | https://qmro.qmul.ac.uk/xmlui/handle/123456789/97926 | |
dc.description.abstract | The class of disentangled sequential auto-encoders factorises speech into time-invariant (global) and time-variant (local) representations for speaker identity and linguistic content, respectively. Many of the existing models employ this assumption to tackle zero-shot voice conversion (VC), which converts speaker characteristics of any given utterance to any novel speakers while preserving the linguistic content. However, balancing capacity between the two representations is intricate, as the global representation tends to collapse due to its lower information capacity along the time axis than that of the local representation. We propose a simple and effective dropout technique that applies an information bottleneck to the local representation via multiplicative Gaussian noise, in order to encourage the usage of the global one. We endow existing zero-shot VC models with the proposed method and show significant improvements in speaker conversion in terms of speaker verification acceptance rate and comparable or better intelligibility measured in character error rate. | en_US |
dc.format.extent | 11676 - 11680 | |
dc.publisher | Institute of Electrical and Electronics Engineers (IEEE) | en_US |
dc.title | Posterior Variance-Parameterised Gaussian Dropout: Improving Disentangled Sequential Autoencoders for Zero-Shot Voice Conversion | en_US |
dc.type | Conference Proceeding | en_US |
dc.rights.holder | © 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | |
dc.identifier.doi | 10.1109/icassp48485.2024.10447835 | |
pubs.notes | Not known | en_US |
rioxxterms.funder | Default funder | en_US |
rioxxterms.identifier.project | Default project | en_US |
qmul.funder | UKRI Centre for Doctoral Training in Artificial Intelligence and Music::Engineering and Physical Sciences Research Council | en_US |
rioxxterms.funder.project | b215eee3-195d-4c4f-a85d-169a4331c138 | en_US |