Skip to content

AI Benchmarks & Leaderboards

Resources for comparing AI model performance through independent benchmarks and community leaderboards.

lmarena.ai — the most widely used crowdsourced AI benchmarking platform, formerly LMSYS Chatbot Arena, now operated by Arena Intelligence. Users rate anonymous head-to-head model comparisons; results are aggregated into an Elo-style rating across 140+ models from over 6 million blind pairwise votes. Specialized arenas exist for coding, math, and vision tasks.

artificialanalysis.ai — independent benchmarks combining quality, speed, and price comparisons for AI models across providers. Publishes the Artificial Analysis Intelligence Index (AAII), an aggregate score across 10 challenging evaluations: MMLU-Pro, Humanity’s Last Exam, GPQA Diamond, AIME, IFBench, SciCode, LiveCodeBench, Terminal-Bench Hard, and τ²-Bench Telecom.

huggingface.co/spaces/open-llm-leaderboard — tracks open-source LLM performance using the EleutherAI LM Evaluation Harness. Evaluates models on IFEval, BBH, MATH, GPQA, MUSR, and MMLU-Pro. Focused exclusively on openly available models.

epoch.ai/benchmarks — tracks frontier model capabilities over time across a wide range of evaluations. As of March 2026, includes ARC-AGI-2, HLE, and APEX-Agents among its tracked benchmarks.

lmcouncil.ai — aggregates scores across multiple benchmarks to produce a composite ranking of frontier models, weighted by category (agentic execution, coding, reasoning, etc.).


BenchmarkFull NameWhat It Measures
MMLUMassive Multitask Language UnderstandingBreadth of factual knowledge across 57 subjects via 15,000+ multiple-choice questions
MMLU-ProHarder variant of MMLU with 10-choice questions and more reasoning-heavy problems; reduces noise from the original
GPQA DiamondGraduate-Level Google-Proof Q&A198 expert-level questions in biology, physics, and chemistry where PhD holders score ~65% and non-experts ~34%
ARCAI2 Reasoning Challenge7,000+ grade-school science questions requiring knowledge and reasoning beyond simple fact retrieval
BBHBIG-Bench Hard23 challenging tasks covering multi-step arithmetic, logical reasoning, geometric reasoning, temporal reasoning, and language understanding
MUSRMultistep Soft Reasoning~1,000-word algorithmic problems (murder mysteries, object placement, team allocation) requiring long-range context parsing
BenchmarkWhat It Measures
GSM8KGrade-school math word problems (8,500 problems) with natural-language solutions; tests arithmetic reasoning
MATHCompetition-level mathematics problems across algebra, calculus, number theory, and combinatorics
AIMEAmerican Invitational Mathematics Examination problems; extremely difficult competition math, increasingly saturated by frontier models
FrontierMathUnsolved or near-unsolved research-level math problems (released Jan 2026); designed to resist saturation
BenchmarkWhat It Measures
HumanEval164 Python programming tasks evaluated by unit tests; tests code generation correctness
LiveCodeBenchContamination-free code evaluation using new problems from LeetCode, AtCoder, and CodeForces; covers code generation, self-repair, and test prediction
SWE-bench Verified500 real GitHub issues from popular Python repos that an agent must resolve end-to-end; measures agentic software engineering

Language Understanding & Instruction Following

Section titled “Language Understanding & Instruction Following”
BenchmarkWhat It Measures
HellaSwagCommonsense reasoning and natural language inference via sentence completion with adversarially filtered distractors
IFEvalInstruction Following Evaluation — tests strict adherence to explicit formatting and content instructions (e.g., “include keyword X”, “use format Y”)
BenchmarkWhat It Measures
HLEHumanity’s Last Exam — 2,500 expert-level questions across mathematics, sciences, and humanities written by subject-matter experts; published in Nature (2026)
ARC-AGI-2Abstract visual puzzles requiring novel pattern recognition; tests fluid intelligence and generalization beyond training data
FrontierScienceOpenAI benchmark for scientific reasoning; includes Olympiad-style problems with tight, verifiable constraints
BenchmarkWhat It Measures
τ-Bench / τ²-BenchCustomer service agent simulation across domains; evaluates multi-step tool use, decision-making, and conversation handling
APEX-AgentsAgentic task completion across realistic multi-step scenarios; added to Epoch index in March 2026

Traditional benchmarks saturate as models improve — once leading models exceed human-level scores, the benchmark stops differentiating between them. Notably saturated as of early 2026:

  • MMLU — most frontier models exceed 90%
  • GSM8K — near-perfect scores common
  • HumanEval — largely replaced by LiveCodeBench for frontier comparisons
  • GPQA and AIME — increasingly saturated as models exceed human expert performance

This has pushed the field toward harder evaluations like HLE, FrontierMath, ARC-AGI-2, and agentic benchmarks.