AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking Paper • 2601.17645 • Published 8 days ago • 22
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published 16 days ago • 32
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development Paper • 2601.11077 • Published 17 days ago • 64
Deriving Character Logic from Storyline as Codified Decision Trees Paper • 2601.10080 • Published 18 days ago • 6
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors Paper • 2601.07226 • Published 21 days ago • 32
OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs Paper • 2601.01592 • Published 29 days ago • 12
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models Paper • 2601.03699 • Published 26 days ago • 6
UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement Paper • 2512.21185 • Published Dec 24, 2025 • 30
Are We on the Right Way to Assessing LLM-as-a-Judge? Paper • 2512.16041 • Published Dec 17, 2025 • 34
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management Paper • 2512.12967 • Published Dec 15, 2025 • 108