Subscribe to our newsletters for the latest updates and exclusive content on leading AI developments.
Researchers from Stanford University and Meta’s Facebook AI Research (FAIR) lab have created a cutting-edge AI system that enables virtual humans to perform synchronized actions with objects based solely on text instructions. This groundbreaking system, named CHOIS (Controllable Human-Object Interaction Synthesis), utilizes advanced conditional diffusion models to effortlessly produce interactions like “lift the table above your head, walk, and put the table down.”
This new development, detailed in a paper published on arXiv, hints at a future where virtual beings can comprehend and execute language commands as smoothly as humans.
The CHOIS system excels in generating realistic and synchronized movements between virtual humans and objects in 3D environments. It uses an initial position for humans and objects, and a language description of tasks, to generate a sequence of movements that faithfully complete the given tasks. An example of this is moving a lamp closer to a sofa, where CHOIS creates a lifelike animation of an avatar picking up the lamp and placing it near the sofa.
What sets CHOIS apart is its use of sparse object waypoints and language descriptions, which serve as markers for key points in the object’s movement. This ensures the motion is not just physically accurate but also aligns with the high-level goal set by the language instruction. Additionally, CHOIS integrates advanced language understanding with physical simulation, a challenging feat for traditional models. It effectively translates verbal commands into a series of physical movements, respecting the constraints of both the human body and involved objects.
CHOIS’s approach is groundbreaking due to its precise handling of contact points, like hands touching objects, and ensuring that the object’s movement corresponds accurately to the avatar’s actions. The model uses specialized loss functions and guidance terms during training to maintain these physical constraints, marking significant progress in creating AI that can understand and interact with the physical world similarly to humans.
The implications for computer graphics, AI, and robotics are substantial. In animation and virtual reality, CHOIS could streamline the creation of complex scenes, reducing the effort needed for detailed keyframe animation. For virtual reality, this technology could lead to more interactive experiences where users can direct virtual characters using natural language, resulting in lifelike interactions.
In AI and robotics, CHOIS represents a significant advance towards more intelligent and context-aware systems. Service robots, for instance, could better understand and perform various tasks described in human language, transforming industries like healthcare, hospitality, and domestic services. This capability moves AI closer to understanding both the “what” and the “how” of human instructions, allowing for adaptability and flexibility in complex tasks.
The work by Stanford and Meta researchers signifies a pivotal step in computer vision, natural language processing, and robotics. Their progress not only enhances current AI systems but also paves the way for future research into synthesizing human-object interactions in 3D environments based on language input, driving towards more sophisticated AI technologies.
Stay informed! Subscribe now for the latest news delivered to your inbox daily.