The ObEN research team is developing AI tools to identify, capture, and mimic human motion from any video and automatically apply those movements to a PAI.


Bringing facial motions to life on your PAI avatar, the ObEN team is researching better, more accurate facial tracking technology using smartphone or webcams.


ObEN’s PAI technology creates an avatar the looks and sounds like you, capable of speaking multiple languages including Chinese, Japanese, Korean, and English.


ObEN’s PAIs can be used by people all over the world – here our co-founder and COO demonstrates PAIs created for native Mandarin speakers.


The first of ObEN’s celebrity PAI collaborations, debuted at the 2019 Shanghai World AI conference.


ObEN’s PAI technology was used to create a life-like avatar for Lucas Cochran, Tech Correspondent of the Discovery Channel’s Daily Planet.

ADRIAN’S PAI (K11 Chairman)

The world’s first PAI retail concierge, developed for K11 Shanghai Art Mall featuring K11 Founder and Chairman Adrian Cheng – personalizing the retail experience with AI.


ObEN’s PAIs are capable of performing a variety of movements, generated using our AI technology. What you can dream up, your PAI can do.


ObEN’s speech animation technology automatically generates full body gesture animation given an audio clip as input.


ObEN’s lip syncing technology achieves state-of-the-art audio-driven real-time lip animation and is user and language independent.


With ObEN’s Voice Conversion technology, your PAI can speak any language in your voice. Create personalized, interactive content that can speak natively to people around the world. Our speech AI tech lets us take voice recordings in any language and mimic them in your own voice so that you can personally communicate with anyone in the world.

Polyglot PAI


Tackling a major challenge in TTS technology, we are creating digital voices that can “speak” with a variety of emotions, make them more human and more capable of connecting us with our audio experiences.

Original Voice
TTS Voice
TTS Happy
TTS Angry


A short voice sample is all we need to take any speaking voice and transform it into a pitch-perfect singing voice. Can you tell which voice is human and which is AI?

Aijia's Original Voice
TTS Voice
PAI Singing
PAI/Human Duet


Voice conversion system and method with variance and spectrum compensation (10,249,314)

A voice conversion system for generating realistic, natural-sounding target speech is disclosed. The voice conversion system preferably comprises a neural network for converting the source speech data to estimated target speech data; a global variance correction module; a modulation spectrum correction module; and a waveform generator. < Back...

Read More

System and method for the analysis and synthesis of periodic and non-periodic components of speech signals (10,354,671)

A voice coder configured to resolve periodic and aperiodic components of spectra is disclosed. The method of voice coding includes parsing the speech signal into a plurality of speech frames; for each of the plurality of speech frames: (a) generating the spectra for the speech frame, (b) parsing the spectra of the speech frame into a plurality of...

Read More

Method and system for speech-to-singing voice conversion (10,008,193)

A singing voice conversion system configured to generate a song in the voice of a target singer based on a song in the voice of a source singer is disclosed. The embodiment utilizes two complementary approaches to voice timbre conversion. Both combine the natural prosody of a source singer with the pitch of the target singer–typically the...

Read More

Text to speech synthesis using deep neural network with constant unit length spectrogram (10,186,252)

A system and method for converting text to speech is disclosed. The text is decomposed into a sequence of phonemes and a text feature matrix constructed to define the manner in which the phonemes are pronounced and accented. A spectrum generator then queries a neural network to produce normalized spectrograms based on the input of the sequence of...

Read More

Voice conversion using deep neural network with intermediate voice training (10,186,251)

A system and method of converting source speech to target speech using intermediate speech data is disclosed. The method comprises identifying intermediate speech data that match target voice training data based on acoustic features; performing dynamic time warping to match the second set of acoustic features of intermediate speech data and the...

Read More

Creation and application of audio avatars from human voices (9,324,318)

A subject voice is characterized and altered to mimic a target voice while maintaining the verbal message of the subject voice. Thus, the words and message are the same as in the original voice, but the voice that conveys the words and message in the altered voice is different. Audio signals corresponding to the altered voice are output, for...

Read More


Understanding Beauty via Deep Facial Features

The concept of beauty has been debated by philosophers and psychologists for centuries, but most definitions are subjective and metaphysical, and deficit in accuracy, generality, and scalability. In this paper, we present a novel study on mining beauty semantics of facial attributes based on big data, with an attempt to objectively construct...

Read More

Face Beautification: Beyond Makeup Transfer

Facial appearance plays an important role in our social lives. Subjective perception of women’s beauty depends on various face-related (e.g., skin, shape, hair) and environmental (e.g., makeup, lighting, angle) factors. Similar to cosmetic surgery in the physical world, virtual face beautification is an emerging field with many open issues...

Read More

Digital Twin: Acquiring High-Fidelity 3D Avatar from a Single Image

We present an approach to generate high fidelity 3D face avatar with a high-resolution UV texture map from a single image. To estimate the face geometry, we use a deep neural network to directly predict vertex coordinates of the 3D face model from the given image. The 3D face geometry is further refined by a non-rigid deformation process to more...

Read More

Data Selection for Improving Naturalness of TTS Voices Trained on Small Found Corpuses

This work investigates techniques that select training data from small, found corpuses in order to improve the naturalness of synthesized text-to-speech voices. The approach outlined in this paper examines different metrics to detect and reject segments of training data that can degrade the performance of the system. We conducted experiments on...

Read More

A Spectrally Weighted Mixture of Least Square Error and Wasserstein Discriminator Loss for Generative SPSS

Generative networks can create an artificial spectrum based on its conditional distribution estimate instead of predicting only the mean value, as the Least Square (LS) solution does. This is promising since the LS predictor is known to oversmooth features leading to muffling effects. However, modeling a whole distribution instead of a single mean...

Read More

Show, Attend and Translate: Unsupervised Image Translation with Self-Regularization and Attention

Image translation between two domains is a class of problems aiming to learn mapping from an input image in the source domain to an output image in the target domain. It has been applied to numerous domains, such as data augmentation, domain adaptation and unsupervised training. When paired training data is not accessible, image translation...

Read More

A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

Voice conversion (VC) aims at conversion of speaker characteristic without altering content. Due to training data limitations and modeling imperfections, it is difficult to achieve believable speaker mimicry without introducing processing artifacts; performance assessment of VC, therefore, usually involves both speaker similarity and quality...

Read More

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems. The objective of the challenge was to perform speaker conversion (i.e.\ transform the vocal identity) of a source speaker to...

Read More

ESTHER : Extremely Simple Image Translation Through Self-Regularization

Image translation between two domains is a class of problems where the goal is to learn the mapping from an input image in the source domain to an output image in the target domain. It has important applications such as data augmentation, domain adaptation, and unsupervised training. When paired training data are not accessible, the mapping...

Read More

Investigation of using disentangled and interpretable representations for one-shot cross-lingual voice conversion

This blog post presents a one-shot voice conversion technique, in which a variational autoencoder (VAE) is used to disentangle speech factors. We show that VAEs are able to disentangle the speaker identity and linguistic content from speech acoustic features. Modification of these factors allow transformation of voice. We show that the...

Read More

One-shot Voice Conversion using Variational Autoencoders

This blog post presents a one-shot voice conversion technique, in which a variational autoencoder (VAE) is used to disentangle speech factors. We show that VAEs are able to disentangle the speaker identity and linguistic content from speech acoustic features. Modification of these factors allow transformation of voice. We show that the...

Read More

Voice Approximation for Inter-Gender Voice Personalization

One of the key technologies at ObEN is the personalization of voice identity, consisting of a transformation of an input voice (e.g., from a Text-To-Speech system) to render it perceptually similar to a target one (e.g., a celebrity, or a user’s voice). Although some existing technologies, known as Voice Conversion and based on a statistical...

Read More

ObEN is an artificial intelligence company that is building a decentralized AI platform for Personal AI (PAI), intelligent 3D avatars that look, sound, and behave like the individual user. Deployed on the Project PAI blockchain, ObEN’s technology enables users to create, use, and manage their own PAI on a secure, decentralized platform - enabling never before possible social and virtual interactions. Founded in 2014, ObEN is a K11, Tencent, Softbank Ventures Korea and HTC Vive X portfolio company and is located at Idealab in Pasadena, California.

130 West Union Street,Pasadena, CA 91103 |
© 2017 ObEN, Inc. All rights reserved

Privacy Policy
and Terms of Use

[mailgun id=””]