AI Benchmarks & Leaderboards

Resources for comparing AI model performance through independent benchmarks and community leaderboards.

🏆 Leaderboards

LMArena (Chatbot Arena)

lmarena.ai — the most widely used crowdsourced AI benchmarking platform, formerly LMSYS Chatbot Arena, now operated by Arena Intelligence. Users rate anonymous head-to-head model comparisons; results are aggregated into an Elo-style rating across 140+ models from over 6 million blind pairwise votes. Specialized arenas exist for coding, math, and vision tasks.

Artificial Analysis

artificialanalysis.ai — independent benchmarks combining quality, speed, and price comparisons for AI models across providers. Publishes the Artificial Analysis Intelligence Index (AAII), an aggregate score across 10 challenging evaluations: MMLU-Pro, Humanity’s Last Exam, GPQA Diamond, AIME, IFBench, SciCode, LiveCodeBench, Terminal-Bench Hard, and τ²-Bench Telecom.

Hugging Face Open LLM Leaderboard

huggingface.co/spaces/open-llm-leaderboard — tracks open-source LLM performance using the EleutherAI LM Evaluation Harness. Evaluates models on IFEval, BBH, MATH, GPQA, MUSR, and MMLU-Pro. Focused exclusively on openly available models.

Epoch AI Capabilities Index

epoch.ai/benchmarks — tracks frontier model capabilities over time across a wide range of evaluations. As of March 2026, includes ARC-AGI-2, HLE, and APEX-Agents among its tracked benchmarks.

LM Council / BenchLM

lmcouncil.ai — aggregates scores across multiple benchmarks to produce a composite ranking of frontier models, weighted by category (agentic execution, coding, reasoning, etc.).

📐 Benchmark Tests

Knowledge & Reasoning

Benchmark	Full Name	What It Measures
MMLU	Massive Multitask Language Understanding	Breadth of factual knowledge across 57 subjects via 15,000+ multiple-choice questions
MMLU-Pro	—	Harder variant of MMLU with 10-choice questions and more reasoning-heavy problems; reduces noise from the original
GPQA Diamond	Graduate-Level Google-Proof Q&A	198 expert-level questions in biology, physics, and chemistry where PhD holders score ~65% and non-experts ~34%
ARC	AI2 Reasoning Challenge	7,000+ grade-school science questions requiring knowledge and reasoning beyond simple fact retrieval
BBH	BIG-Bench Hard	23 challenging tasks covering multi-step arithmetic, logical reasoning, geometric reasoning, temporal reasoning, and language understanding
MUSR	Multistep Soft Reasoning	~1,000-word algorithmic problems (murder mysteries, object placement, team allocation) requiring long-range context parsing

Mathematics

Benchmark	What It Measures
GSM8K	Grade-school math word problems (8,500 problems) with natural-language solutions; tests arithmetic reasoning
MATH	Competition-level mathematics problems across algebra, calculus, number theory, and combinatorics
AIME	American Invitational Mathematics Examination problems; extremely difficult competition math, increasingly saturated by frontier models
FrontierMath	Unsolved or near-unsolved research-level math problems (released Jan 2026); designed to resist saturation

Coding

Benchmark	What It Measures
HumanEval	164 Python programming tasks evaluated by unit tests; tests code generation correctness
LiveCodeBench	Contamination-free code evaluation using new problems from LeetCode, AtCoder, and CodeForces; covers code generation, self-repair, and test prediction
SWE-bench Verified	500 real GitHub issues from popular Python repos that an agent must resolve end-to-end; measures agentic software engineering

Language Understanding & Instruction Following

Benchmark	What It Measures
HellaSwag	Commonsense reasoning and natural language inference via sentence completion with adversarially filtered distractors
IFEval	Instruction Following Evaluation — tests strict adherence to explicit formatting and content instructions (e.g., “include keyword X”, “use format Y”)

Frontier / Expert-Level

Benchmark	What It Measures
HLE	Humanity’s Last Exam — 2,500 expert-level questions across mathematics, sciences, and humanities written by subject-matter experts; published in Nature (2026)
ARC-AGI-2	Abstract visual puzzles requiring novel pattern recognition; tests fluid intelligence and generalization beyond training data
FrontierScience	OpenAI benchmark for scientific reasoning; includes Olympiad-style problems with tight, verifiable constraints

Agentic

Benchmark	What It Measures
τ-Bench / τ²-Bench	Customer service agent simulation across domains; evaluates multi-step tool use, decision-making, and conversation handling
APEX-Agents	Agentic task completion across realistic multi-step scenarios; added to Epoch index in March 2026

⚠️ Benchmark Saturation

Traditional benchmarks saturate as models improve — once leading models exceed human-level scores, the benchmark stops differentiating between them. Notably saturated as of early 2026:

MMLU — most frontier models exceed 90%
GSM8K — near-perfect scores common
HumanEval — largely replaced by LiveCodeBench for frontier comparisons
GPQA and AIME — increasingly saturated as models exceed human expert performance

This has pushed the field toward harder evaluations like HLE, FrontierMath, ARC-AGI-2, and agentic benchmarks.

Sources

Arena Leaderboard — lmarena.ai — LMArena leaderboard overview
Artificial Analysis Intelligence Benchmarking Methodology — AAII composition
Hugging Face Open LLM Leaderboard About — benchmark list and methodology
Epoch AI Benchmarks — capabilities index
LLM Benchmarks Compared — lxt.ai — benchmark descriptions
30 LLM evaluation benchmarks — evidentlyai.com — benchmark overviews
Humanity’s Last Exam — HLE benchmark description
GPQA Diamond — Epoch AI — GPQA description
Frontier Model Benchmark Comparison — Glia.ca (March 2026) — March 2026 rankings
OpenAI FrontierScience — BigDATAwire — FrontierScience launch