Microsoft’s New Model Elevates AI Video with Trajectory-Based Generation

Microsoft's New Model Elevates AI Video with Trajectory-Based Generation

Sign up for our daily and weekly newsletters to stay updated with the latest industry-leading AI developments and exclusive content. Learn More

AI companies are in a race to perfect video generation. Recently, several companies, including Stability AI and Pika Labs, have released models that can create videos using text and image prompts. Microsoft AI has taken this a step further with their new model, DragNUWA, which offers more precise control over video production.

DragNUWA enhances the traditional methods of text and image prompting by incorporating trajectory-based generation. This allows users to manipulate objects or entire video frames along specific paths, resulting in highly controlled video generation that covers semantic, spatial, and temporal aspects while maintaining high-quality output.

Microsoft has made the model weights and a demo for DragNUWA available to the community, inviting them to experiment with it. However, it’s important to remember that this is still a research project and is not yet perfect.

What makes DragNUWA stand out?

Historically, AI video generation has relied on either text, images, or trajectories, but each method has struggled to provide fine-tuned control over the output. For example, the combination of text and images alone can’t capture intricate motion details in a video. Images and trajectories might not fully represent future objects and their movements, and language can be ambiguous, making it hard to distinguish between abstract concepts, like a real fish versus a painting of a fish.

To address these issues, Microsoft’s AI team introduced DragNUWA in August 2023. This open-domain, diffusion-based video generation model integrates images, text, and trajectories to offer highly controlled video generation from semantic, spatial, and temporal aspects. Users can define text, image, and trajectory inputs to control elements like camera movements and object motion in the final video.

For instance, you could upload an image of a boat on a lake, add a text prompt like “a boat sailing in the lake,” and specify the boat’s trajectory. This would create a video showing the boat sailing in the intended direction, combining motion details, future object descriptions, and object distinctions.

DragNUWA in Action

Released on Hugging Face

In the early version 1.5 of DragNUWA, now available on Hugging Face, Microsoft uses Stability AI’s Stable Video Diffusion model to animate images or objects along a specified path. Once refined, this technology could revolutionize video generation and editing, making it easy to transform backgrounds, animate images, and direct motion paths with simple lines.

The AI community is excited about this development, viewing it as a significant advancement in creative AI. However, its real-world performance remains to be seen. Microsoft’s tests indicate that the model can achieve accurate camera movements and object motions with different drag trajectories.

DragNUWA supports complex curved trajectories, enabling objects to move along intricate paths. It allows for variable trajectory lengths, with longer trajectories producing larger motions. Additionally, it can control the trajectories of multiple objects simultaneously. No existing video generation model has achieved such detailed trajectory control, highlighting DragNUWA’s potential to advance controllable video generation in future applications.

This work adds to the growing body of research in AI video technology. Recently, Pika Labs garnered attention by launching a text-to-video interface similar to ChatGPT that produces high-quality short videos with various customizations.

Stay in the know! Get the latest news in your inbox daily. Subscribe. By subscribing, you agree to VentureBeat’s Terms of Service.