AI Benchmarks & Leaderboards
Resources for comparing AI model performance through independent benchmarks and community leaderboards.
🏆 Leaderboards
Section titled “🏆 Leaderboards”LMArena (Chatbot Arena)
Section titled “LMArena (Chatbot Arena)”lmarena.ai — the most widely used crowdsourced AI benchmarking platform, formerly LMSYS Chatbot Arena, now operated by Arena Intelligence. Users rate anonymous head-to-head model comparisons; results are aggregated into an Elo-style rating across 140+ models from over 6 million blind pairwise votes. Specialized arenas exist for coding, math, and vision tasks.
Artificial Analysis
Section titled “Artificial Analysis”artificialanalysis.ai — independent benchmarks combining quality, speed, and price comparisons for AI models across providers. Publishes the Artificial Analysis Intelligence Index (AAII), an aggregate score across 10 challenging evaluations: MMLU-Pro, Humanity’s Last Exam, GPQA Diamond, AIME, IFBench, SciCode, LiveCodeBench, Terminal-Bench Hard, and τ²-Bench Telecom.
Hugging Face Open LLM Leaderboard
Section titled “Hugging Face Open LLM Leaderboard”huggingface.co/spaces/open-llm-leaderboard — tracks open-source LLM performance using the EleutherAI LM Evaluation Harness. Evaluates models on IFEval, BBH, MATH, GPQA, MUSR, and MMLU-Pro. Focused exclusively on openly available models.
Epoch AI Capabilities Index
Section titled “Epoch AI Capabilities Index”epoch.ai/benchmarks — tracks frontier model capabilities over time across a wide range of evaluations. As of March 2026, includes ARC-AGI-2, HLE, and APEX-Agents among its tracked benchmarks.
LM Council / BenchLM
Section titled “LM Council / BenchLM”lmcouncil.ai — aggregates scores across multiple benchmarks to produce a composite ranking of frontier models, weighted by category (agentic execution, coding, reasoning, etc.).
📐 Benchmark Tests
Section titled “📐 Benchmark Tests”Knowledge & Reasoning
Section titled “Knowledge & Reasoning”| Benchmark | Full Name | What It Measures |
|---|---|---|
| MMLU | Massive Multitask Language Understanding | Breadth of factual knowledge across 57 subjects via 15,000+ multiple-choice questions |
| MMLU-Pro | — | Harder variant of MMLU with 10-choice questions and more reasoning-heavy problems; reduces noise from the original |
| GPQA Diamond | Graduate-Level Google-Proof Q&A | 198 expert-level questions in biology, physics, and chemistry where PhD holders score ~65% and non-experts ~34% |
| ARC | AI2 Reasoning Challenge | 7,000+ grade-school science questions requiring knowledge and reasoning beyond simple fact retrieval |
| BBH | BIG-Bench Hard | 23 challenging tasks covering multi-step arithmetic, logical reasoning, geometric reasoning, temporal reasoning, and language understanding |
| MUSR | Multistep Soft Reasoning | ~1,000-word algorithmic problems (murder mysteries, object placement, team allocation) requiring long-range context parsing |
Mathematics
Section titled “Mathematics”| Benchmark | What It Measures |
|---|---|
| GSM8K | Grade-school math word problems (8,500 problems) with natural-language solutions; tests arithmetic reasoning |
| MATH | Competition-level mathematics problems across algebra, calculus, number theory, and combinatorics |
| AIME | American Invitational Mathematics Examination problems; extremely difficult competition math, increasingly saturated by frontier models |
| FrontierMath | Unsolved or near-unsolved research-level math problems (released Jan 2026); designed to resist saturation |
Coding
Section titled “Coding”| Benchmark | What It Measures |
|---|---|
| HumanEval | 164 Python programming tasks evaluated by unit tests; tests code generation correctness |
| LiveCodeBench | Contamination-free code evaluation using new problems from LeetCode, AtCoder, and CodeForces; covers code generation, self-repair, and test prediction |
| SWE-bench Verified | 500 real GitHub issues from popular Python repos that an agent must resolve end-to-end; measures agentic software engineering |
Language Understanding & Instruction Following
Section titled “Language Understanding & Instruction Following”| Benchmark | What It Measures |
|---|---|
| HellaSwag | Commonsense reasoning and natural language inference via sentence completion with adversarially filtered distractors |
| IFEval | Instruction Following Evaluation — tests strict adherence to explicit formatting and content instructions (e.g., “include keyword X”, “use format Y”) |
Frontier / Expert-Level
Section titled “Frontier / Expert-Level”| Benchmark | What It Measures |
|---|---|
| HLE | Humanity’s Last Exam — 2,500 expert-level questions across mathematics, sciences, and humanities written by subject-matter experts; published in Nature (2026) |
| ARC-AGI-2 | Abstract visual puzzles requiring novel pattern recognition; tests fluid intelligence and generalization beyond training data |
| FrontierScience | OpenAI benchmark for scientific reasoning; includes Olympiad-style problems with tight, verifiable constraints |
Agentic
Section titled “Agentic”| Benchmark | What It Measures |
|---|---|
| τ-Bench / τ²-Bench | Customer service agent simulation across domains; evaluates multi-step tool use, decision-making, and conversation handling |
| APEX-Agents | Agentic task completion across realistic multi-step scenarios; added to Epoch index in March 2026 |
⚠️ Benchmark Saturation
Section titled “⚠️ Benchmark Saturation”Traditional benchmarks saturate as models improve — once leading models exceed human-level scores, the benchmark stops differentiating between them. Notably saturated as of early 2026:
- MMLU — most frontier models exceed 90%
- GSM8K — near-perfect scores common
- HumanEval — largely replaced by LiveCodeBench for frontier comparisons
- GPQA and AIME — increasingly saturated as models exceed human expert performance
This has pushed the field toward harder evaluations like HLE, FrontierMath, ARC-AGI-2, and agentic benchmarks.
Sources
Section titled “Sources”- Arena Leaderboard — lmarena.ai — LMArena leaderboard overview
- Artificial Analysis Intelligence Benchmarking Methodology — AAII composition
- Hugging Face Open LLM Leaderboard About — benchmark list and methodology
- Epoch AI Benchmarks — capabilities index
- LLM Benchmarks Compared — lxt.ai — benchmark descriptions
- 30 LLM evaluation benchmarks — evidentlyai.com — benchmark overviews
- Humanity’s Last Exam — HLE benchmark description
- GPQA Diamond — Epoch AI — GPQA description
- Frontier Model Benchmark Comparison — Glia.ca (March 2026) — March 2026 rankings
- OpenAI FrontierScience — BigDATAwire — FrontierScience launch