Stay in the loop with our daily and weekly newsletters to get the latest updates and exclusive content on cutting-edge AI developments.
Language is essential for human interaction, but so is the emotion behind it. Expressing feelings like happiness, sadness, anger, and frustration is crucial for conveying our messages and connecting with others. While generative AI is advanced in many areas, it has struggled with understanding and processing human emotions. Typecast, a startup using AI to create synthetic voices and videos, claims to have made significant strides with its new Cross-Speaker Emotion Transfer technology.
This new tech allows users to apply emotions recorded from another’s voice to their own while keeping their unique style. This makes content creation faster and more efficient. It’s available now through Typecast’s My Voice Maker feature.
Typecast CEO and co-founder, Taesu Kim, noted that AI actors haven’t yet mastered human emotional range, which has been a significant limitation. However, with the new Cross-Speaker Emotion Transfer, anyone can use AI actors to convey real emotional depth with just a small sample of their voice.
Decoding emotions can be complex. While most emotions fit into seven categories—happiness, sadness, anger, fear, surprise, and disgust—this is not sufficient for the wide range of emotions in speech. Speaking isn’t just about matching text to spoken words. Humans can express the same sentence in countless ways, showing a variety of emotions even within the same sentence or word.
For example, the sentence “How can you do this to me?” can sound completely different when said in a sad voice versus an angry one. Similarly, an emotion like “So sad because her father passed away but showing a smile” is complex and not easily categorized. Humans naturally speak with different emotions, making conversations rich and diverse.
Text-to-speech technology has rapidly improved, especially with models like ChatGPT, LaMDA, LLama, Bard, and Claude. However, emotional text-to-speech still requires large amounts of labeled data, which is hard to gather. Capturing different emotional nuances through voice recordings is challenging and time-consuming.
Kim explained that conventional emotional speech synthesis requires all training data to have a specific emotion label. This often needs extra emotion encoding or reference audio, posing a challenge because there must be available data for every emotion and speaker. Mislabeling and difficulty in extracting intensity are additional issues. Cross-speaker emotion transfer becomes more challenging when assigning unseen emotions to a speaker, often resulting in unnatural outputs. Controlling emotion intensity can also be problematic.
To solve these issues, researchers first inputted emotion labels into a generative deep neural network, a groundbreaking method but insufficient for sophisticated emotional expressions. They then developed an unsupervised learning algorithm to discern speaking styles and emotions from a large database. The entire model was trained without emotion labels, producing representative numbers that can be used in text-to-speech algorithms.
A perception neural network was trained to translate natural language emotion descriptions into representations. This technology learns from a large database of emotional voices, eliminating the need for extensive recordings of various speaking styles and emotions.
The researchers achieved controllable emotion speech synthesis using latent representation, domain adversarial training, and cycle-consistency loss to separate the speaker from the style. The technology learns from vast quantities of recorded human voices to analyze emotional patterns, tones, and inflections. It successfully transfers emotion to a neutral speaker using just a few labeled samples, and emotion intensity can be controlled easily.
Users can record a basic snippet of their voice, apply a range of emotions, and the AI adapts to their specific voice characteristics. They can also select different types of emotional speech recorded by others, applying that style to their voice while preserving their unique identity. With just five minutes of recorded voice, users can express various emotions even if they initially spoke in a normal tone.
Typecast’s technology has been used by companies like Samsung Securities and LG Electronics and has raised $26.8 billion since its founding in 2017. The startup is also working on applying its speech synthesis technology to facial expressions.
Controllability in generative AI is essential for content creation, especially in today’s fast-paced media environment where short-form videos dominate. High-quality, expressive voice is crucial for corporate messaging, and fast, affordable production is key. Manual work by human actors is inefficient, and these generative AI technologies help people and companies unleash their creative potential and boost productivity.