Abstract
VisionLanguageAction (VLA) models benefit from chainofthought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (LaRA-VLA), a unified VLA framework that internalizes multimodal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, actionoriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets and evaluate LaRA-VLA on both simulation benchmarks and long horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA consistently outperforms stateoftheart VLA methods while reducing inference latency by up to 90% compared to explicit CoTbased approaches, demonstrating latent reasoning as an effective and efficient paradigm for real-time embodied control.
LaRA-VLA
Training proceeds in three stages: (i) explicit CoT finetuning with aligned visual prediction latents and inversedynamics supervision for actions; (ii) a curriculumbased transition from explicit CoT to compact text latents, gradually reducing the number of text tokens while increasing reliance on latent reasoning, where the latent representations are also implicitly supervised by visual and action signals; and (iii) adaptation of latent-conditioned VLM features to an action expert for efficient action generation without explicit CoT at inference time.
Experiments
Simulation Experiments
Libero
On LIBERO, LaRA-VLA achieves the best overall performance with an average success rate of 97.9%, including 99.8% on the Object suite and 96.6% on the Long suite, demonstrating strong object-centric reasoning and robustness in long-horizon manipulation.
SimplerEnv
On SimplerEnv-WidowX, LaRA-VLA attains the highest average success rate of 68.8%, outperforming NoCoT, Textual CoT, and Visual CoT baselines. Across both benchmarks, LaRA-VLA consistently surpasses textual and visual CoT methods, indicating that latent reasoning provides more effective and stable guidance for action prediction and generalizes better than explicit CoT supervision.
Real-world Experiments
LaRA-VLA consistently outperforms ACT and GR00T N1.5 across all four long-horizon real-world manipulation tasks, achieving the highest average success rate. The improvements are especially pronounced on tasks requiring multi-stage reasoning and sustained temporal coordination, highlighting enhanced robustness to error accumulation over long horizons.
LaRA-VLA
Groot N1.5
Analysis
Latent Collapse
Latent tokens associated with different reasoning components form wellseparated and semantically coherent clusters, demonstrating clear functional specialization rather than degeneration into uniform or uninformative representations. Moreover, latent representations of language instruction tokens (gray points) remain structured and occupy a distinct subspace from reasoning latents, indicating that latent CoT does not trivially reuse language embeddings.
Inference Time
LaRA-VLA significantly reduces inference latency, achieving 135 ms per rollout and outperforming all baselines by a large margin. Compared to explicit CoT methods, this yields up to a 90% reduction in inference time, demonstrating the efficiency benefits of latent reasoning without explicit CoT decoding.
Ablation
Ablation study of different forms of CoT supervision on SimplerEnv. TextCoT denotes explicit textual chain of thought, Latent TextCoT denotes latent textual chain of thought, and Latent VisCoT denotes latent visual chain of thought.