Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning—especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing π₀-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
Our model unifies information from different modalities into a discrete interleaved sequence, which is modeled using an autoregressive Transformer. To enable unified modeling, images are discretized using vector-quantized (VQ) encoders, while actions are transformed into the frequency domain and discretized via Discrete Cosine Transform (DCT) encoding. This causal multimodal sequence naturally preserves the temporal dynamics and causality required for real-world tasks. The model builds upon a pretrained vision-language model and follows a two-stage training strategy: (1) a post-training phase that adopts world-model training on large-scale datasets without requiring actions, and (2) a fine-tuning phase that interleaves actions into the sequence, enabling policy learning on downstream tasks.
Method | Task | 1 | 2 | 3 | 4 | 5 | Avg. Len ↑ |
---|---|---|---|---|---|---|---|
MCIL | ABCD→D | 0.373 | 0.027 | 0.002 | 0.000 | 0.000 | 0.40 |
RT-1 | ABCD→D | 0.844 | 0.617 | 0.438 | 0.323 | 0.227 | 2.45 |
Robo-Flamingo | ABCD→D | 0.964 | 0.896 | 0.824 | 0.740 | 0.660 | 4.09 |
GR-1 | ABCD→D | 0.949 | 0.896 | 0.844 | 0.789 | 0.731 | 4.21 |
UP-VLA | ABCD→D | 0.962 | 0.921 | 0.879 | 0.842 | 0.812 | 4.42 |
RoboVLMs | ABCD→D | 0.967 | 0.930 | 0.899 | 0.865 | 0.826 | 4.49 |
UniVLA | ABCD→D | 0.985 | 0.961 | 0.931 | 0.899 | 0.851 | 4.63 |
MCIL | ABC→D | 0.304 | 0.013 | 0.002 | 0.000 | 0.000 | 0.31 |
Robo-Flamingo | ABC→D | 0.824 | 0.619 | 0.466 | 0.331 | 0.235 | 2.47 |
SuSIE | ABC→D | 0.870 | 0.690 | 0.490 | 0.380 | 0.260 | 2.69 |
GR-1 | ABC→D | 0.854 | 0.712 | 0.596 | 0.497 | 0.401 | 3.06 |
UP-VLA | ABC→D | 0.928 | 0.865 | 0.815 | 0.769 | 0.699 | 4.08 |
RoboVLMs | ABC→D | 0.980 | 0.936 | 0.854 | 0.778 | 0.704 | 4.25 |
Seer-Large | ABC→D | 0.963 | 0.916 | 0.861 | 0.803 | 0.740 | 4.28 |
UniVLA | ABC→D | 0.989 | 0.948 | 0.890 | 0.828 | 0.751 | 4.41 |
Method | SPATIAL | OBJECT | GOAL | LONG | Average ↑ |
---|---|---|---|---|---|
DP* | 78.3% | 92.5% | 68.3% | 50.5% | 72.4% |
Octo | 78.9% | 85.7% | 84.6% | 51.1% | 75.1% |
OpenVLA | 84.9% | 88.4% | 79.2% | 53.7% | 76.5% |
SpatialVLA | 88.2% | 89.9% | 78.6% | 55.5% | 78.1% |
CoT-VLA | 87.5% | 91.6% | 87.6% | 69.0% | 81.1% |
π₀‑FAST | 96.4% | 96.8% | 88.6% | 60.2% | 85.5% |
UniVLA | 95.4% | 98.8% | 93.6% | 94.0% | 95.5% |
Model | Put Spoon on Towel | Put Carrot on Plate | Stack Green on Yellow Block | Put Eggplant in Yellow Basket | Overall Success |
||||
---|---|---|---|---|---|---|---|---|---|
Grasp | Success | Grasp | Success | Grasp | Success | Grasp | Success | ||
RT-1-X | 16.7% | 0.0% | 20.8% | 4.2% | 8.3% | 0.0% | 0.0% | 0.0% | 1.1% |
Octo-Base | 34.7% | 12.5% | 52.8% | 8.3% | 31.9% | 0.0% | 66.7% | 43.1% | 16.0% |
Octo-Small | 77.8% | 47.2% | 27.8% | 9.7% | 40.3% | 4.2% | 87.5% | 56.9% | 29.5% |
OpenVLA | 4.1% | 0.0% | 33.3% | 0.0% | 12.5% | 0.0% | 8.3% | 4.1% | 1.0% |
RoboVLMs | 70.8% | 45.8% | 33.3% | 20.8% | 54.2% | 4.2% | 91.7% | 79.2% | 37.5% |
SpatialVLA | 20.8% | 16.7% | 29.2% | 25.0% | 62.5% | 29.2% | 100% | 100% | 42.7% |
UniVLA | 83.3% | 83.3% | 74.0% | 66.7% | 95.8% | 33.3% | 100.0% | 95.8% | 69.8% |
Post-training Stage | Generalization | Long-horizon | ||||
---|---|---|---|---|---|---|
Strategy | Sequence | Supervision | LIBERO | SimplerEnv-WidowX | LIBERO-Long | CALVIN |
48.5 | 0.0 | 17.4 | 1.46 | |||
action prediction | T, I, A | action | 43.9 (-4.6) | 0.0 | 10.6 (-6.8) | 0.52 (-0.94) |
text-to-image | T, I | vision | 69.8 (+21.3) | 6.3 (+6.3) | 55.8 (+38.4) | 3.79 (+2.33) |
video prediction | I₁,...,Iₜ | vision | 78.9 (+30.4) | 17.7 (+17.7) | 80.8 (+63.4) | 3.59 (+2.13) |
world model | T, I₁,...,Iₜ | vision | 94.2 (+45.7) | 64.6 (+64.6) | 89.2 (+71.8) | 4.61 (+3.15) |