Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Camb AI, a startup based in Dubai that focuses on AI-driven content localization technologies, has just announced the release of Mars5, an advanced AI model for voice cloning. Although there are several models that can create digital voice replicas, such as those from ElevenLabs, Camb claims that Mars5 stands out by providing a much higher level of realism.
Based on early samples from the company, Mars5 not only copies the original voice but also captures complex aspects like rhythm, emotion, and intonation. Camb also offers support for nearly three times as many languages as ElevenLabs. Mars5 supports over 140 languages, including less common ones like Icelandic and Swahili, compared to ElevenLabs’ 36. However, the open-source version available on GitHub is currently limited to English. The multi-language version is available through the company’s paid service, Camb Studio.
“The level of prosody and realism that Mars5 is able to capture, even with just a few seconds of input, is unprecedented. This is a pivotal moment in speech,” said Akshat Prakash, the co-founder and CTO of the company.
Traditionally, voice cloning and text-to-speech conversion have been separate technologies. Voice cloning takes a voice sample to create a voice clone, while text-to-speech uses that clone to turn text into synthetic speech. These technologies could potentially make anyone appear as if they are saying anything. With Mars5, Camb AI combines both capabilities into one platform. Users only need to upload an audio file and provide the text content. The model then uses the voice in the audio file as a reference, capturing details like speaking style, emotion, and meaning, to synthesize the provided text as speech.
The company claims that Mars5 can handle various emotional tones and pitches, making it suitable for complex speech scenarios – whether the speaker is frustrated, commanding, calm, or spirited. This makes it ideal for content like sports commentary, movies, and anime.
To achieve this level of detail, Mars5 uses a combination of a 750M parameter autoregressive model and a 450M parameter non-autoregressive model, working with 6kbps encodec tokens. According to Prakash, the autoregressive model predicts the most basic codebook value, while the non-autoregressive model fills in the remaining values through a process of discrete denoising diffusion.
Although specific benchmarks are not yet available, early tests with samples show that Mars5 outperforms popular speech synthesis models from companies like Metavoice and ElevenLabs. While other models produce clear synthesized speech, Mars5’s results sound much closer to the original voice.
Prakash mentioned that ElevenLabs uses a larger dataset, but Camb’s superior model design has proven better at learning the nuances of speech. He expects Mars5 to improve even more over time with larger datasets and more training, with successive updates released on GitHub.
Camb AI is also working on another model called Boli, which is designed for translation with contextual understanding, correct grammar, and apt colloquialism. Boli aims to offer a more natural translation experience, particularly for languages that are low- to medium-resource. Prakash said that Boli’s translations are already outperforming mainstream tools like Google Translate and even generative models like ChatGPT.
Currently, both Mars5 and Boli are available through Camb’s platform, Camb Studio, covering 140 languages. The company also offers these capabilities as APIs to businesses and developers. While Prakash didn’t disclose the number of customers, he mentioned that Camb AI is collaborating with Major League Soccer, Tennis Australia, Maple Leaf Sports & Entertainment, major movie and music studios, and several government agencies.
One notable achievement is that Camb AI live-dubbed a Major League Soccer game into four languages simultaneously for over 2 hours, an industry first. They also translated post-match conferences at the Australian Open into multiple languages and translated the psychological thriller “Three” from Arabic to Mandarin.