This blog post presents a one-shot voice conversion technique, in which a variational autoencoder (VAE) is used to disentangle speech factors. We show that VAEs are able to disentangle the speaker identity and linguistic content from speech acoustic features. Modification of these factors allow transformation of voice. We show that the representation disentanglement performs relatively well on utterances from unseen languages.

Voice Conversion

The task of Voice Conversion (VC) is a technique to convert source speaker’s spoken sentences into that of a target speaker’s voice. It requires to preserve not only the target speaker’s identity, but also phonetic context spoken by the source speaker. To tackle this problem, many approaches have been proposed, however, most prior work require parallel spoken corpus and enough amount of data to learn the target speaker’s voice.

Variational Autoencoder for learning disentangled speech representation

We use a newly proposed architecture, Factorized Hierarchical VAEs (FHVAEs). VAEs considers no structure for latent variable z. Assuming structure for z could be beneficial to exploit the inherent structures in data. FHVAEs are able to uncover disentangled representation from unparallel speech corpus with numerous speakers.


ObEN is an artificial intelligence company that creates complete virtual identities for consumers and celebrities in the emerging digital world. ObEN provides Personal AI that simulates a person’s voice, face and personality, enabling never before possible social and virtual interactions. Founded in 2014, ObEN is a Softbank Ventures Korea and HTC Vive X portfolio company and is located at Idealab in Pasadena, California.