This blog post presents a one-shot voice conversion technique, in which a variational autoencoder (VAE) is used to disentangle speech factors. We show that VAEs are able to disentangle the speaker identity and linguistic content from speech acoustic features. Modification of these factors allow transformation of voice. We show that the representation disentanglement performs relatively well on utterances from unseen languages.

Voice Conversion

The task of Voice Conversion (VC) is a technique to convert source speaker’s spoken sentences into that of a target speaker’s voice. It requires to preserve not only the target speaker’s identity, but also phonetic context spoken by the source speaker. To tackle this problem, many approaches have been proposed, however, most prior work require parallel spoken corpus and enough amount of data to learn the target speaker’s voice.

Variational Autoencoder for learning disentangled speech representation

We use a newly proposed architecture, Factorized Hierarchical VAEs (FHVAEs). VAEs considers no structure for latent variable z. Assuming structure for z could be beneficial to exploit the inherent structures in data. FHVAEs are able to uncover disentangled representation from unparallel speech corpus with numerous speakers.

They can help VC with very limited resource from target speaker, since it might infer speaker identity information from data without supervision. Or performing VC, given a source and target utterance, we compute the speaker and linguistic embeddings from the utterances. We then move the source speaker embeddings towards the average of target speaker embedding.

Improving the State of the Art

We trained two FHVAEs: 1) one on TIMIT English speech corpus with 462 speakers and 2) a proprietary Chinese corpus with 5200 speakers. For testing, we use 1) Four CMU-arctic speakers and 2) Four speakers from THCHS-30 Chinese corpus.  We perform permutation of Inter-lingual/cross-lingual conversions using English and Chinese FHVAEs. We are interested to see the performance of the systems using seen or unseen speakers and languages.

We visually investigate the performance of the models. The following figure shows the 2D plot (computed using PCA) speaker embeddings computed using TIMIT-trained FHVAE.  Each point represents single utterance and different colors represent different speaker/languages; blueish dots are English females and light blueish are Chinese females; and reddish dots are English males and orange dots are Chinese males. In the left subfigures, the embeddings are calculated from 1 utterance, and in the right subfigures, we use 5 utterances.

In all subplots, the female and male embedding cluster locations are clearly separated. Furthermore, the plot shows that the speaker embeddings of unique speakers fall near the same location. Although when 5 utterances are used to compute the embedding value, the variation is visibly less compared to when merely one sentence is used. This shows sensitiveness of the speaker embedding computed from the model to sentence variations. Also it is interesting to note that when both TIMIT+CH corpus are used for training, the speaker embeddings are further apart suggesting a better model property. One phenomenon that we notice is that the speaker embeddings for different languages and gender fall to different locations.

The phonetic context matrix over the computed utterances (compressed using PCA) is shown in the following Figure for the sentence “She had your dark suit in greasy wash water all year”.


Ideally, we want the matrices should be close to each other since the phonetic context embedding is supposed to be speaker- independent. The figure show the closeness of the embeddings at the similar time frames. There is still some minor discrepancy between the embeddings which shows room for further improvement of model architecture and/or larger speech corpus.

We perform subjective tests to evaluate the speech quality and speaker similarity. We select GMM-MAP as baseline[a]. We employ 40 listeners per test to score the stimuli.

In the following plot, we show the speech quality Comparative Mean Opinion Score (CMOS) for between-language, and between-gender conversions. In this test, listeners heard two stimuli A and B with the same content, generated using the same source speaker, but in two different processing conditions, and were then asked to indicate whether they thought B was better or worse than A, using a five-point scale comprised of +2 (much better), +1 (somewhat better), 0 (same), -1 (somewhat worse), -2 (much worse).

We also perform a speaker similarity test to assess how well the method is converting speaker identity. In this test, listeners heard two stimuli A and B with different content, and were then asked to indicate whether they thought that A and B were spoken by the same, or by two different speakers, using a five-point scale comprised of +2 (definitely same), +1 (probably same), 0 (unsure), -1 (probably different), and – 2 (definitely different).


Here are some samples of one-shot VC using only one utterance from target speaker.

Male English to Female English:

  • Original Male:
  • Original Female:
  • Conversion Male2Female:

Male Chinese to Female English:

  • Original Male:
  • Original Female:
  • Conversion Male2Female:

For more details, please take a look at our paper (to be presented at Interspeech 2018) and samples.



ObEN is an artificial intelligence company that creates complete virtual identities for consumers and celebrities in the emerging digital world. ObEN provides Personal AI that simulates a person’s voice, face and personality, enabling never before possible social and virtual interactions. Founded in 2014, ObEN is a Softbank Ventures Korea and HTC Vive X portfolio company and is located at Idealab in Pasadena, California.