This work investigates techniques that select training data from small, found corpuses in order to improve the naturalness of synthesized text-to-speech voices. The approach outlined in this paper examines different metrics to detect and reject segments of training data that can degrade the performance of the system. We conducted experiments on two small datasets extracted from Mandarin Chinese audiobooks that have different characteristics in terms of recording conditions, narrator, and transcriptions. We show that using a even smaller, yet carefully selected, set of data can lead to a text-to-speech system able to generate more natural speech than a system trained on the complete dataset. Three metrics related to the narrator’s articulation proposed in the paper give significant improvements in naturalness.


ObEN is an artificial intelligence company that creates complete virtual identities for consumers and celebrities in the emerging digital world. ObEN provides Personal AI that simulates a person’s voice, face and personality, enabling never before possible social and virtual interactions. Founded in 2014, ObEN is a Softbank Ventures Korea and HTC Vive X portfolio company and is located at Idealab in Pasadena, California.