This blog post presents a one-shot voice conversion technique, in which a variational autoencoder (VAE) is used to disentangle speech factors. We show that VAEs are able to disentangle the speaker identity and linguistic content from speech acoustic features. Modification of these factors allow transformation of voice. We show that the representation disentanglement performs relatively well on utterances from unseen languages.
The task of Voice Conversion (VC) is a technique to convert source speaker’s spoken sentences into that of a target speaker’s voice. It requires to preserve not only the target speaker’s identity, but also phonetic context spoken by the source speaker. To tackle this problem, many approaches have been proposed, however, most prior work require parallel spoken corpus and enough amount of data to learn the target speaker’s voice.
Variational Autoencoder for learning disentangled speech representation
We use a newly proposed architecture, Factorized Hierarchical VAEs (FHVAEs). VAEs considers no structure for latent variable z. Assuming structure for z could be beneficial to exploit the inherent structures in data. FHVAEs are able to uncover disentangled representation from unparallel speech corpus with numerous speakers.