Get the latest updates and exclusive content on industry-leading AI by joining our daily and weekly newsletters.
Foundation models have significantly improved robotics, leading to the development of vision-language-action (VLA) models that can handle various objects, scenes, and tasks beyond their initial training data. However, their adoption has faced obstacles due to their closed nature and the lack of best practices for their deployment and adaptation to new environments.
To tackle these issues, a group of researchers from Stanford University, UC Berkeley, Toyota Research Institute, Google Deepmind, and other labs have introduced OpenVLA. This open-source VLA model is trained on a diverse range of real-world robot demonstrations.
According to these researchers, OpenVLA outperforms similar models in robotic tasks and can be easily fine-tuned for general use in multi-task environments involving multiple objects. Additionally, it is optimized to run on consumer-grade GPUs, making fine-tuning very cost-effective.
Foundation models are becoming integral to robotics, and OpenVLA aims to make these models more accessible and customizable for various companies and research labs.
Classic robotic manipulation policies often fail to generalize beyond their training data and struggle with unseen objects and new task instructions. In contrast, large language models (LLMs) and vision-language models (VLMs) capture a broader range of world knowledge from their extensive pretraining datasets, enabling better generalization.
Recent research has started utilizing LLMs and VLMs in training robotic policies. One approach involves using pre-trained LLMs and VLMs as components in task planning and execution systems. Another method is training vision-language-action models (VLAs) from scratch to directly control robot actions. Notable examples of VLAs include RT-2 and RT-2-X, which have set new benchmarks for generalist robot policies.
However, current VLAs have two main challenges: they are closed systems with little transparency in their architecture, training procedures, and data usage, and there is a lack of best practices for adapting VLAs to new robots, environments, and tasks.
The researchers argue that to foster future development and innovation, robotics needs open-source, versatile VLAs that support effective fine-tuning and adaptation, similar to the ecosystem around open-source language models.
OpenVLA is a 7-billion-parameter open-source VLA based on the Prismatic-7B vision-language model. It features a two-part visual encoder that processes input images and a Llama-2 7B model for handling language instructions.
To develop OpenVLA, the researchers fine-tuned the Prismatic model on a large dataset of 970,000 robot manipulation trajectories from the Open-X Embodiment dataset. This dataset includes a wide variety of robot embodiments, tasks, and scenes. The model was configured to output special tokens that correspond to robot actions.
OpenVLA can take a natural language instruction like “wipe the table” along with an input image from a camera. It then reasons over the instruction and visual input to decide the sequence of action tokens needed for the robot to perform the task.
The researchers state that OpenVLA outperforms the previous leading model, the 55-billion-parameter RT-2-X, in tasks using the WidowX and Google Robot embodiments.
Experiments with efficient fine-tuning strategies showed that OpenVLA excels in various manipulation tasks, from object pick-and-place to cleaning a table. Fine-tuning OpenVLA enhances performance in tasks that map language instructions to multi-task behaviors with multiple objects.
Unlike previous models that excel in either narrow single-instruction or diverse multi-instruction tasks, OpenVLA maintains a success rate of at least 50% across all tested tasks. This makes it a strong candidate for imitation learning tasks, particularly those involving a wide range of language instructions.
To make OpenVLA more accessible and compute-efficient, the researchers employed optimization techniques like low-rank adaptation (LoRA). This allowed them to fine-tune OpenVLA on a new task within 10-15 hours on a single A100 GPU, an 8x reduction in compute time compared to full fine-tuning. Model quantization techniques further reduced the model size, enabling it to run on consumer-grade GPUs without a significant drop in performance.
The researchers have made OpenVLA’s models, deployment and fine-tuning notebooks, and codebase available as open-source resources. This initiative aims to support future research and adaptation of VLAs for robotics. The library enables fine-tuning on individual GPUs and training billion-parameter VLAs on multi-node GPU clusters, in addition to being compatible with modern optimization and parallelization techniques.
Looking ahead, the researchers plan to enhance OpenVLA by allowing it to support multiple image and proprioceptive inputs, as well as observation history. They also suggest that using vision-language models pre-trained on interleaved image and text data could further facilitate flexible-input VLA fine-tuning.
Stay informed with the latest news delivered daily by subscribing to our newsletter. By subscribing, you agree to our Terms of Service.