Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

Shuanghao Bai* 1,2 Jing Lyu* 2,3,4 Wanqi Zhou1 Zhe Li2 Dakai Wang1 Lei Xing1 Xiaoguang Zhao3 Pengwei Wang2 Zhongyuan Wang2 Cheng Chi2 Badong Chen1 Shanghang Zhang2,5

* Equal contribution

1 Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University. 2 Beijing Academy of Artificial Intelligence. 3 Institute of Automation, University of Chinese Academy of Sciences. 4 School of Artificial Intelligence, University of Chinese Academy of Sciences. 5 Peking University.

Correspondence to: Cheng Chi, Badong Chen, Shanghang Zhang.

Abstract

VisionLanguageAction (VLA) models benefit from chainofthought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (LaRA-VLA), a unified VLA framework that internalizes multimodal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, actionoriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets and evaluate LaRA-VLA on both simulation benchmarks and long horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA consistently outperforms stateoftheart VLA methods while reducing inference latency by up to 90% compared to explicit CoTbased approaches, demonstrating latent reasoning as an effective and efficient paradigm for real-time embodied control.

LaRA-VLA

Model architecture

Training proceeds in three stages: (i) explicit CoT finetuning with aligned visual prediction latents and inversedynamics supervision for actions; (ii) a curriculumbased transition from explicit CoT to compact text latents, gradually reducing the number of text tokens while increasing reliance on latent reasoning, where the latent representations are also implicitly supervised by visual and action signals; and (iii) adaptation of latent-conditioned VLM features to an action expert for efficient action generation without explicit CoT at inference time.

Experiments

Simulation Experiments

Libero

LIBERO results

On LIBERO, LaRA-VLA achieves the best overall performance with an average success rate of 97.9%, including 99.8% on the Object suite and 96.6% on the Long suite, demonstrating strong object-centric reasoning and robustness in long-horizon manipulation.

SimplerEnv

SimplerEnv WidowX results

On SimplerEnv-WidowX, LaRA-VLA attains the highest average success rate of 68.8%, outperforming NoCoT, Textual CoT, and Visual CoT baselines. Across both benchmarks, LaRA-VLA consistently surpasses textual and visual CoT methods, indicating that latent reasoning provides more effective and stable guidance for action prediction and generalizes better than explicit CoT supervision.

Real-world Experiments

Real-world task results

LaRA-VLA consistently outperforms ACT and GR00T N1.5 across all four long-horizon real-world manipulation tasks, achieving the highest average success rate. The improvements are especially pronounced on tasks requiring multi-stage reasoning and sustained temporal coordination, highlighting enhanced robustness to error accumulation over long horizons.

LaRA-VLA

Groot N1.5

Analysis

Latent Collapse

Latent collapse analysis

Latent tokens associated with different reasoning components form wellseparated and semantically coherent clusters, demonstrating clear functional specialization rather than degeneration into uniform or uninformative representations. Moreover, latent representations of language instruction tokens (gray points) remain structured and occupy a distinct subspace from reasoning latents, indicating that latent CoT does not trivially reuse language embeddings.

Inference Time

Inference time comparison

LaRA-VLA significantly reduces inference latency, achieving 135 ms per rollout and outperforming all baselines by a large margin. Compared to explicit CoT methods, this yields up to a 90% reduction in inference time, demonstrating the efficiency benefits of latent reasoning without explicit CoT decoding.

Ablation

Ablation study

Ablation study of different forms of CoT supervision on SimplerEnv. TextCoT denotes explicit textual chain of thought, Latent TextCoT denotes latent textual chain of thought, and Latent VisCoT denotes latent visual chain of thought.