Posterior Variance-Parameterised Gaussian Dropout: Improving Disentangled Sequential Autoencoders for Zero-Shot Voice Conversion
View/ Open
Pagination
11676 - 11680
Publisher
DOI
10.1109/icassp48485.2024.10447835
Metadata
Show full item recordAbstract
The class of disentangled sequential auto-encoders factorises speech into time-invariant (global) and time-variant (local) representations for speaker identity and linguistic content, respectively. Many of the existing models employ this assumption to tackle zero-shot voice conversion (VC), which converts speaker characteristics of any given utterance to any novel speakers while preserving the linguistic content. However, balancing capacity between the two representations is intricate, as the global representation tends to collapse due to its lower information capacity along the time axis than that of the local representation. We propose a simple and effective dropout technique that applies an information bottleneck to the local representation via multiplicative Gaussian noise, in order to encourage the usage of the global one. We endow existing zero-shot VC models with the proposed method and show significant improvements in speaker conversion in terms of speaker verification acceptance rate and comparable or better intelligibility measured in character error rate.