One of the key technologies at ObEN is the personalization of voice identity, consisting of a transformation of an input voice (e.g., from a Text-To-Speech system) to render it perceptually similar to a target one (e.g., a celebrity, or a user’s voice). Although some existing technologies, known as Voice Conversion and based on a statistical mapping of acoustic information of the speakers, can achieve a reasonable identity transformation by using several minutes of speech, the performance is often limited in two main aspects:
- The process rarely yields artifact-free output: the resulting converted speech commonly presents a perceived degraded quality.
- An inter-gender transformation (transforming a female voice to a male voice or vice versa) often results in lower perceived similarity scores (when compared to the target voice). Notably, a robust transformation of the gender type is not always achieved.
Figure 1 – Conventional voice timbre transformation schema
The latter problem can be explained by the significant differences found in the voice production apparatus between males and females (and similarly, between kids and adults) and the extended challenge that this represents for the transformation process if compared to the intra-gender case. Note also that it can be easier to perceive a degradation of the speech sound in terms of a lack of naturalness if the transformation cannot achieve a perceived change of the speaker gender.
A well-known strategy to achieve speaker gender transformation is based on the normalization of the vocal-tract length conditions, considered to be at the basis of the main differences perceived in the voice timbre between speakers of different gender. Despite its effectiveness and popularity in speech synthesis and recognition studies, the computation of the normalization information is not always robust and straightforward enough for real applications.
Figure 2 – Proposed approximation-based voice timbre transformation schema
ObEN’s proprietary technique for automatic gender transformation allows a voice approximation stage to boost the performance of voice personalization technology. By using a small amount of speech data (~30 seconds), ObEN’s technique is able to achieve robust gender-based timbre transformation, notably reducing the acoustic gap between input and target voices in our personalization process under low-degradation conditions.
We show below an example of both types of gender transformation including the target voices (first row), the original input voices (second row) and a modified version of them matching the prosody of the target ones (third row). Although a forced prosodic alignment of this nature results generally a degradation of the sound quality we use it for comparison purposes to better perceive the differences related exclusively to the modification of the voice timbre.
Note that ObEN’s voice approximation technique (fourth row) yields to a convincing gender transformation towards the perceived timbre of the target voices.
Inputs After Prosody Alignment
(degradations appear due to the forced pitch and duration modification)
Aligned Inputs After Voice Approximation
About the Author: Fernando is a Principal Speech Research Scientist at ObEN