A system and method for converting text to speech is disclosed. The text is decomposed into a sequence of phonemes and a text feature matrix constructed to define the manner in which the phonemes are pronounced and accented. A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of phonemes and features. Normalized spectrograms are fixed-length spectrograms with uniform temporal length (i.e., data size), which enables them to be effectively encoded into a neural network representation. A duration generator output a plurality of durations that are associated with phonemes. A speech synthesizer modifies the temporal length (i.e., de-normalizes) of each normalized spectrogram based on the associated duration, and then combines the plurality of modified spectrograms into speech. To de-normalize the spectrograms retrieved from the neural network, the normalized spectrograms are generally expanded in time or compressed in time, thereby producing variable length spectrograms which yield speech that is realistic sounding.

Share

ObEN is an artificial intelligence company that creates complete virtual identities for consumers and celebrities in the emerging digital world. ObEN provides Personal AI that simulates a person’s voice, face and personality, enabling never before possible social and virtual interactions. Founded in 2014, ObEN is a Softbank Ventures Korea and HTC Vive X portfolio company and is located at Idealab in Pasadena, California.