An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges Paper • 2512.11362 • Published 22 days ago • 21
Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition Paper • 2512.15603 • Published 17 days ago • 58
LLaDA2.0: Scaling Up Diffusion Language Models to 100B Paper • 2512.15745 • Published 24 days ago • 78
DeContext as Defense: Safe Image Editing in Diffusion Transformers Paper • 2512.16625 • Published 16 days ago • 24
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling Paper • 2512.14614 • Published 18 days ago • 67
TimeLens Collection TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs • 5 items • Updated 17 days ago • 8
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper • 2512.13168 • Published 19 days ago • 49
Towards Scalable Pre-training of Visual Tokenizers for Generation Paper • 2512.13687 • Published 19 days ago • 98
VTP Collection Towards Scalable Pre-training of Visual Tokenizers for Generation • 4 items • Updated 18 days ago • 39
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training Paper • 2505.17589 • Published May 23, 2025 • 5
jina-vlm Collection Jina-VLM: Small Multilingual Vision Language Model • 3 items • Updated 19 days ago • 8
OmniPSD: Layered PSD Generation with Diffusion Transformer Paper • 2512.09247 • Published 25 days ago • 46
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance Paper • 2512.08765 • Published 25 days ago • 128