Unified Vision-Language-Action Model

Abstract

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning—especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing π₀-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

UniVLA

Our model unifies information from different modalities into a discrete interleaved sequence, which is modeled using an autoregressive Transformer. To enable unified modeling, images are discretized using vector-quantized (VQ) encoders, while actions are transformed into the frequency domain and discretized via Discrete Cosine Transform (DCT) encoding. This causal multimodal sequence naturally preserves the temporal dynamics and causality required for real-world tasks. The model builds upon a pretrained vision-language model and follows a two-stage training strategy: (1) a post-training phase that adopts world-model training on large-scale datasets without requiring actions, and (2) a fine-tuning phase that interleaves actions into the sequence, enabling policy learning on downstream tasks.

CALVIN Benchmark

Method	Task	1	2	3	4	5	Avg. Len ↑
MCIL	ABCD→D	0.373	0.027	0.002	0.000	0.000	0.40
RT-1	ABCD→D	0.844	0.617	0.438	0.323	0.227	2.45
Robo-Flamingo	ABCD→D	0.964	0.896	0.824	0.740	0.660	4.09
GR-1	ABCD→D	0.949	0.896	0.844	0.789	0.731	4.21
UP-VLA	ABCD→D	0.962	0.921	0.879	0.842	0.812	4.42
RoboVLMs	ABCD→D	0.967	0.930	0.899	0.865	0.826	4.49
UniVLA	ABCD→D	0.985	0.961	0.931	0.899	0.851	4.63
MCIL	ABC→D	0.304	0.013	0.002	0.000	0.000	0.31
Robo-Flamingo	ABC→D	0.824	0.619	0.466	0.331	0.235	2.47
SuSIE	ABC→D	0.870	0.690	0.490	0.380	0.260	2.69
GR-1	ABC→D	0.854	0.712	0.596	0.497	0.401	3.06
UP-VLA	ABC→D	0.928	0.865	0.815	0.769	0.699	4.08
RoboVLMs	ABC→D	0.980	0.936	0.854	0.778	0.704	4.25
Seer-Large	ABC→D	0.963	0.916	0.861	0.803	0.740	4.28
UniVLA	ABC→D	0.989	0.948	0.890	0.828	0.751	4.41

LIBERO Benchmark

Method	SPATIAL	OBJECT	GOAL	LONG	Average ↑
DP*	78.3%	92.5%	68.3%	50.5%	72.4%
Octo	78.9%	85.7%	84.6%	51.1%	75.1%
OpenVLA	84.9%	88.4%	79.2%	53.7%	76.5%
SpatialVLA	88.2%	89.9%	78.6%	55.5%	78.1%
CoT-VLA	87.5%	91.6%	87.6%	69.0%	81.1%
π₀‑FAST	96.4%	96.8%	88.6%	60.2%	85.5%
UniVLA	95.4%	98.8%	93.6%	94.0%	95.5%

SimplerEnv-Bridge Benchmark

Model	Put Spoon on Towel		Put Carrot on Plate		Stack Green on Yellow Block		Put Eggplant in Yellow Basket		Overall Success
Model	Grasp	Success	Grasp	Success	Grasp	Success	Grasp	Success	Overall Success
RT-1-X	16.7%	0.0%	20.8%	4.2%	8.3%	0.0%	0.0%	0.0%	1.1%
Octo-Base	34.7%	12.5%	52.8%	8.3%	31.9%	0.0%	66.7%	43.1%	16.0%
Octo-Small	77.8%	47.2%	27.8%	9.7%	40.3%	4.2%	87.5%	56.9%	29.5%
OpenVLA	4.1%	0.0%	33.3%	0.0%	12.5%	0.0%	8.3%	4.1%	1.0%
RoboVLMs	70.8%	45.8%	33.3%	20.8%	54.2%	4.2%	91.7%	79.2%	37.5%
SpatialVLA	20.8%	16.7%	29.2%	25.0%	62.5%	29.2%	100%	100%	42.7%
UniVLA	83.3%	83.3%	74.0%	66.7%	95.8%	33.3%	100.0%	95.8%	69.8%

World Model Facilitating Downstream Policy Learning

Post-training Stage			Generalization		Long-horizon
Strategy	Sequence	Supervision	LIBERO	SimplerEnv-WidowX	LIBERO-Long	CALVIN
			48.5	0.0	17.4	1.46
action prediction	T, I, A	action	43.9 (-4.6)	0.0	10.6 (-6.8)	0.52 (-0.94)
text-to-image	T, I	vision	69.8 (+21.3)	6.3 (+6.3)	55.8 (+38.4)	3.79 (+2.33)
video prediction	I₁,...,Iₜ	vision	78.9 (+30.4)	17.7 (+17.7)	80.8 (+63.4)	3.59 (+2.13)
world model	T, I₁,...,Iₜ	vision	94.2 (+45.7)	64.6 (+64.6)	89.2 (+71.8)	4.61 (+3.15)