LARY — A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

LARYBench

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated.

We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.

To systematically build a robust latent action model, we conduct ablations under the LAPA framework. The experiment maps a performance evolution path bridging the gap between the baseline (LAPA) and the continuous upper bound (V-JEPA2). In this repository, we release the LAPA-DINOv2 with the best hyper-parameters tuned in LARYBench. For details, please refer to Github.

Performance Evolution of Latent Action Models. The (*) denotes the default quantization settings for LAPA-DINOv3 (cs = 8, sl = 16, dim = 32). cs, sl, and dim represent Codebook Size, Sequence Length and Latent Dimension respectively. Composite Human classification on the left and AgiBotWorld-Beta regression on the right.

License

This model is derived from DINOv2, originally developed by Meta AI Research and released under the Apache License 2.0.

Modifications

This model has been fine-tuned and/or modified based on DINOv2. All modifications are documented in the accompanying model card and/or source code repository. The original DINOv2 architecture, pre-trained weights, and associated code are used in compliance with the Apache License 2.0.

Attribution

If you use this model in your research or products, please cite the original DINOv2 work as well.

Citation

If you find this work useful, please cite:

@misc{nie2026larylatentactionrepresentation,
      title={LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment}, 
      author={Dujun Nie and Fengjiao Chen and Qi Lv and Jun Kuang and Xiaoyu Li and Xuezhi Cao and Xunliang Cai},
      year={2026},
      eprint={2604.11689},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.11689}, 
}