Join our daily and weekly newsletters to get the latest updates and exclusive content on leading AI developments.
Startups like ElevenLabs have raised significant funds to create their unique algorithms and AI software for voice cloning, which are audio programs that mimic users’ voices.
Now, there’s a new option: OpenVoice, developed by researchers from MIT, Tsinghua University in Beijing, and AI startup MyShell. OpenVoice offers open-source voice cloning that is almost instantaneous and provides detailed controls not available on other platforms.
OpenVoice lets you clone voices with exceptional precision. You can adjust the tone, emotion, accent, rhythm, pauses, and intonation using just a small audio clip.
The company has also shared a link to its research paper on how OpenVoice was developed and provided access through the MyShell web app (account required) and HuggingFace (publicly accessible).
VentureBeat contacted one of the leading researchers, Zengyi Qin, from MIT and MyShell. Qin stated that MyShell aims to benefit the entire research community by providing resources like grants, datasets, and computing power to support open-source research. The core principle of MyShell is “AI for All.”
Qin explained the motivation behind starting with an open-source voice cloning AI model. He mentioned that in the field of research, while language and vision have good open-source models, there was a need for a powerful, instant voice cloning model that everyone could customize. This gap led them to develop OpenVoice.
When I tested the new OpenVoice model on HuggingFace, I quickly generated a relatively convincing, though somewhat robotic, clone of my voice using random speech.
Unlike other apps, I didn’t need to read a specific text; I just spoke freely for a few seconds, and the model created a voice clone that played back my provided text prompt almost instantly.
I could also change the “style” to various emotions like cheerful, sad, friendly, or angry using a dropdown menu and noticed significant changes in tone.
To create OpenVoice, the team used two AI models: a text-to-speech (TTS) model and a “tone converter.” The TTS model controls style parameters and languages and was trained on 30,000 audio samples from speakers with American, British, Chinese, and Japanese accents. The tone converter was trained on over 300,000 samples from more than 20,000 speakers. Both models convert human speech into phonemes, which are then used to clone the user’s voice with added emotional expression.
The approach is conceptually simple but effective, using fewer compute resources than other methods like Meta’s Voicebox.
MyShell aims to develop the most flexible instant voice cloning model, allowing control over style, emotion, and accent in any language. The team achieved this by breaking down the complex task into manageable subtasks.
Founded in 2023 with a $5.6 million seed round led by INCE Capital, MyShell already has over 400,000 users. It describes itself as a decentralized and comprehensive platform for discovering, creating, and staking AI-native apps.
MyShell’s web app includes various AI characters and bots with different personalities, similar to Character.AI, and offers tools like an animated GIF maker and user-generated text-based RPGs.
To make money, MyShell charges a monthly subscription for its web app users and third-party bot creators who want to promote their products. It also sells AI training data.