Researchers at the University of Tokyo and Alternative Machine have created a humanoid robot system called Alter3, which can directly translate natural language commands into robot actions. This robot leverages the extensive knowledge in large language models (LLMs) like GPT-4 to perform complex tasks such as taking selfies or mimicking a ghost.
This development is part of a growing trend that combines the capabilities of foundational models with robotics. Although these systems are not yet commercially scalable, they have significantly advanced robotics research and show great potential.
How LLMs control robots
Alter3 uses GPT-4 as its backend model. When it receives a natural language instruction, the model plans a series of actions for the robot to achieve its goal. Initially, the model acts as a planner, determining the necessary steps to perform the desired action.
The action plan is then passed to a coding agent, which generates the commands needed for the robot to execute each step. Since GPT-4 isn’t trained on Alter3’s programming commands, researchers use its in-context learning ability to adapt its behavior to the robot’s API. This involves providing a list of commands and examples in the prompt, allowing the model to map each step to one or more API commands for the robot to execute.
Learning from human feedback
Language isn’t always precise for describing physical poses, so the action sequence generated by the model might not perfectly produce the desired behavior. To address this, the researchers added functionality for human feedback, allowing instructions like “Raise your arm a bit more.” These are sent to another GPT-4 agent, which adjusts the code and returns the revised action sequence to the robot. The refined action and code are stored in a database for future use.
The researchers tested Alter3 on various tasks, including everyday actions like taking selfies and drinking tea, as well as mimicry motions like pretending to be a ghost or a snake. They also evaluated the model’s ability to respond to scenarios requiring elaborate action planning.
GPT-4’s extensive knowledge of human behaviors and actions enables it to create more realistic behavior plans for humanoid robots like Alter3. The experiments showed that the robot could even mimic emotions such as embarrassment and joy.
More advanced models
The use of foundational models is becoming more common in robotics research. For instance, Figure, valued at $2.6 billion, uses OpenAI models to understand human instructions and perform real-world actions. As multi-modality becomes standard in foundational models, robotics systems will better understand their environment and choose appropriate actions.
Alter3 is part of a group of projects that use off-the-shelf foundational models for reasoning and planning in robotics control systems. It doesn’t use a fine-tuned version of GPT-4, and the researchers note that the code can be applied to other humanoid robots.
Other projects like RT-2-X and OpenVLA use specialized foundational models designed to directly produce robotics commands. These models generally yield more stable results and can handle more tasks and environments, but they require technical expertise and are more costly to develop.
One often overlooked challenge in these projects is creating robots capable of basic tasks like grasping objects, maintaining balance, and moving around. As AI and robotics research scientist Chris Paxton mentioned, “There’s a lot of other work that goes on at the level below that those models aren’t handling. And that’s the kind of stuff that is hard to do. And in a lot of ways, it’s because the data doesn’t exist.”