Meta Introduces Audiobox: A Revolutionary AI for Voice Cloning and Ambient Sound Generation

Meta Introduces Audiobox: A Revolutionary AI for Voice Cloning and Ambient Sound Generation

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage.

Voice cloning is growing quickly due to generative AI innovations. This technology recreates a person’s voice – capturing their pitch, timbre, rhythms, mannerisms, and unique pronunciations.

Several startups like ElevenLabs have received substantial funding for voice cloning technology. Meanwhile, Meta Platforms, the parent company of Facebook, Instagram, WhatsApp, and Oculus VR, has launched its own free voice cloning program, Audiobox, though it comes with some limitations.

Audiobox was introduced today on Meta’s website by researchers at the Facebook AI Research lab, and it’s described as a fresh base model for audio generation. This new program builds on previous work known as Voicebox. It can generate voices and sound effects using voice inputs and text prompts, making it easy to create custom audio for various purposes. You can enter a sentence or describe a sound you want to create, and Audiobox will handle the rest. Users can even record their own voice and use Audiobox to clone it.

Meta has developed a “family of models” through Audiobox: one for mimicking speech and another for producing ambient sounds and effects like dogs barking or sirens. These models are based on a shared self-supervised learning model known as Audiobox SSL.

Self-supervised learning (SSL) is a deep learning technique where AI algorithms generate their own data labels rather than relying on pre-labeled data, which is the norm in supervised learning. The researchers explained their methodology in a scientific paper, citing that the lack of reliable labeled data necessitates training using unlabeled audio. This allows for greater data scaling and generalization.

Leading AI models often rely heavily on human-created data for training, and Audiobox is no different. The FAIR researchers used “160K hours of speech (primarily English), 20K hours of music, and 6K hours of sound samples,” covering a wide range of audio types and acoustic conditions. This data includes speakers from over 150 countries, representing more than 200 primary languages to ensure fairness and diversity.

While the paper did not specify the exact sources of this data or its public availability, this raises important questions, especially since multiple AI companies face lawsuits over using potentially copyrighted material. According to Meta, Audiobox was trained on publicly available and licensed datasets, though specific sources were not disclosed.

To demonstrate Audiobox’s capabilities, Meta released interactive demos that let users record their voices and hear them cloned. Users can also generate new voices based on text descriptions or modify recorded voices. For example, a prompt like “dogs barking” can produce realistic sound effects.

However, Meta includes a disclaimer that Audiobox is a research demo and cannot be used for commercial purposes. Additionally, it’s not available to residents of Illinois or Texas due to state laws on audio collection.

Interestingly, despite Meta’s recent open-source commitment with Llama 2 models, Audiobox is not open source. Meta plans to invite researchers and institutions to apply for grants to conduct safety and responsibility research with Audiobox. The technology is being selectively released to researchers with a history in speech research to enhance research quality and address responsible AI aspects.

Currently, Audiobox cannot be used for business or by residents of certain states. However, with rapid advancements in AI, commercial versions may soon emerge, potentially from Meta or other companies.