Google’s Innovative Multimodal AI VideoPoet Stuns with its Visual Capabilities

Google’s Innovative Multimodal AI VideoPoet Stuns with its Visual Capabilities

Subscribe to our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage.

Yesterday, I wondered if Google could ever nail an AI product release on the first go. It looks like they just might have, based on their latest research.

This week, Google introduced VideoPoet, a new large language model (LLM) crafted for various video generation tasks by a team of 31 researchers at Google Research.

The creation of an LLM specifically for video tasks is noteworthy. Traditionally, most models use diffusion-based methods, considered top performers in video generation. These models typically start with a pretrained image model, like Stable Diffusion, which produces high-quality images for individual frames, and then fine-tune the model for better temporal consistency across frames.

In contrast, Google Research chose to use an LLM, a type of AI model based on the transformer architecture, usually employed for text and code generation in applications like ChatGPT, Claude 2, or Llama 2. Instead of training it for text and code, they trained it to generate videos.

Pre-training played a crucial role in this development. The team pre-trained the VideoPoet LLM on 270 million videos and over a billion text-and-image pairs from public sources, transforming that data into text embeddings, visual tokens, and audio tokens, which the AI model was conditioned on.

The outcomes are remarkable, even when compared to top consumer-facing video generation models like Runway and Pika, the latter being a Google investment.

One of the significant advantages of VideoPoet is its ability to produce longer, higher quality clips with more consistent motion. Current diffusion-based video generating AIs often struggle with maintaining coherent large motions, resulting in glitchy movements after a few frames. VideoPoet, however, can handle larger, more consistent motion across longer videos of 16 frames, and allows for various capabilities such as simulating different camera motions, visual styles, and generating matching audio.

By integrating these capabilities within a single LLM, VideoPoet offers a seamless, all-in-one solution for video creation, eliminating the need for multiple specialized components.

Viewers who evaluated the videos preferred clips generated by VideoPoet over those by other diffusion models like Source-1, VideoCrafter, and Phenaki. The human raters found the VideoPoet clips to follow prompts better and exhibit more interesting motion.

Furthermore, Google Research designed VideoPoet to produce videos in portrait orientation, catering to the mobile video market popularized by platforms like Snap and TikTok.

Looking ahead, Google Research aims to expand VideoPoet’s abilities to support “any-to-any” generation tasks, such as text-to-audio and audio-to-video, pushing the boundaries of video and audio generation.

The only drawback is that VideoPoet is not yet available for public use. We’ve reached out to Google for more information on its release and will update when we hear back. For now, we’ll have to wait eagerly to see how it stacks up against other tools on the market.