CONSUMER TTS-PERSONALIZATION

In the TTS Personalization (TTS-P) project, the goal is to personalize the voice of a third-party Chinese TTS system to make it sound similar to the user, and simultaneously capture the user’s speaking style. This is done with a small training set of 35 sentences (totaling just 2-3 minutes) from the user in regular conditions, using a regular phone. The training sentences have been designed by ObEN’s team of linguists and are phonetically and tonally rich in nature. Some examples of the training sentences are:

“对酒当歌,人生几何。” “小学二年级的时候” “捐款给慈善机构”

Figure 1 – Overview of the TTS personalization project
As shown in Block 1 of Figure 1, ObEN requires each user to record the 35 scripted sentences in normal conditions. ObEN’s current focus is on the Chinese market, but our team of linguists is also developing training data sets in other languages for upcoming or alternative use cases.

Figure 2 – Transcribing the audio recordings (Block 2 of Figure 1)

As shown in Block 2 of Figure 1, after acquiring the 35 recordings from the target user, the speech signal is enhanced. This is done by removing noise and obviously superfluous segments (e.g., silence) automatically. ObEN uses several speech enhancement methods ranging from pure signal processing based to data-driven deep learning based approaches that operate on cepstrum, spectrum and other spectral features variants. As shown in Block 2 of Figure 1, forced alignment is then used to produce phonetic and syllabic alignment information. The time alignment is beneficial because it enables us to parallelize speech units to learn a mapping between them.

When these data are prepared, a mapping is learnt between source TTS voice/intonation and target avatar voice/intonation. The modeling is split into two separate modules:

  • Voice color mapping (spectral mapping), as shown in Block 3 of Figure 1
  • Speaking style mapping (pitch contour and phonetic/syllable duration mapping), as shown in Block 4 of Figure 1

ObEN performs subjective experiments to assess the performance of our systems regarding speaker similarity and speech quality. We use Mean Opinion Score (MOS) to score the performance level of the stimuli. We perform the experiments by hiring native Chinese listeners. As a baseline for quality, we test our output against the quality of the TTS system that we are personalizing. The TTS system achieves a 3.42 MOS score whereas our TTS-P achieves a 3.17 MOS score. The results are reported in the following plots:

Samples

Original TTS output:

—Male 1—

Recording:

TTSP output:

—Male 2—

Recording:

TTSP output:

—Male 3—

Recording:

TTSP output:

—Female 1—

Recording:

TTSP output:

—Female 2—

Recording:

TTSP output:

—Female 3—

Recording:

TTSP output:

Share

ObEN's proprietary artificial intelligence technology quickly combines a person's 2D image and voice to create a personal 3D avatar. Transport your personal avatar into virtual reality and augmented reality environments and enjoy deeper, social, more memorable experiences. Founded in 2014, ObEN is an HTC VIVE X portfolio company and is located in Pasadena, California at leading technology incubator Idealab.