Eval Leaderboards - a andrewrreed Collection

andrewrreed 's Collections

Hallucination Detection

Eval Leaderboards

Small, but mighty chat models

Eval Leaderboards

updated Mar 2

Running

4.9k

Arena Leaderboard

🏆

4.9k

View the LMArena leaderboard in full‑screen
Running on CPU Upgrade

14k

Open LLM Leaderboard

🏆

14k

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

7.42k

MTEB Leaderboard

🥇

7.42k

Embedding Leaderboard
Running

Agents

Featured

587

LLM-Perf Leaderboard

🏆

587

Compare LLM hardware performance and find the best model
Running on CPU Upgrade

Agents

Featured

1.35k

Open ASR Leaderboard

🏆

1.35k

Compare speech‑to‑text models across multiple benchmarks
Running

Agents

1.51k

Big Code Models Leaderboard

📈

1.51k

Explore and compare code model performance on a leaderboard
Runtime error

Agents

145

Hallucinations Leaderboard

🔥

145

View and submit LLM evaluations
Build error

Agents

105

Enterprise Scenarios Leaderboard

🥇

105
Running on CPU Upgrade

Agents

93

LLM Safety Leaderboard

🥇

93

Search, filter and submit LLM benchmark evaluations
Running

Agents

232

AI2 WildBench Leaderboard (V2)

🦁

232

Display LLM performance leaderboards with customizable views
Running

Agents

176

Open Object Detection Leaderboard

🏆

176

Request evaluation for a new model
Running

Agents

31

Contextual Leaderboard

🐨

31

Submit and evaluate models for contextual understanding tasks
Running

192

Yet Another LLM Leaderboard

🌖

192

Launch a Streamlit web app interface
Running on CPU Upgrade

Agents

1.02k

Open VLM Leaderboard

🌎

1.02k

VLMEvalKit Evaluation Results Collection
Running

Featured

561

Vision Arena (Testing VLMs side-by-side)

🖼

561

Explore Vision Arena visual AI demo online
Running

41

Leaderboard

🐠

41

View the LiveCodeBench leaderboard rankings
Runtime error

Agents

Featured

436

Open Medical-LLM Leaderboard

🥇

436

Explore and submit models for benchmarking
Running on CPU Upgrade

Agents

60

Open CoT Leaderboard

🥇

60

Track, rank and evaluate open LLMs' CoT quality
Runtime error

Agents

23

MM-UPD Leaderboard

🥇

23

Submit and evaluate model results on MM-UPD benchmarks
Running

Agents

230

BigCodeBench Leaderboard

🥇

230

Explore code-generation model leaderboards and task details
Runtime error

Agents

10

MJ Bench Leaderboard

🥇

10

Display and filter multimodal model leaderboard results
Running

Agents

430

Reward Bench Leaderboard

📐

430

Explore and compare model scores on RewardBench benchmarks
Running on CPU Upgrade

Agents

449

Agent Leaderboard

💬

449

Ranking of LLMs for agentic tasks
Running

145

Find a leaderboard

🔍

145

Explore and discover all leaderboards from the HF community
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13, 2025 • 75