Benchmark reference

ChessBench dataset and scoring methodology

ChessBench benchmarks LLMs on a fixed set of Lichess tactics and scores them using exact UCI move-line matches. This page is the canonical explanation of what the benchmark measures, where the data comes from, and how model scores are produced.

Dataset ID

Unavailable

Puzzle count

150

Tracks

5

Dataset generated

Unknown

How the benchmark is built

The benchmark uses a fixed dataset generated from Lichess tactics. Each puzzle has a position, a target line, a track label, and an optional source URL back to the original Lichess puzzle.

Models are prompted to return only the final answer as a UCI move line. Scores are based on exact match against the expected move sequence, which keeps comparisons strict and reproducible across providers.

ChessBench also records parse status, latency, token usage, and repair behavior, so benchmark pages can explain whether a miss came from reasoning, formatting, or parsing.

Scoring rules

  • Strict accuracy is the percentage of puzzles where the exact expected UCI line was returned.
  • Parsed accuracy counts exact matches only among outputs that were parseable.
  • Parse rate measures how often a model returned something that could be interpreted as a legal line.
  • ChessBench can recover strict UCI, loose UCI extraction, and SAN-to-UCI conversions before scoring.

Sources and benchmark pages

Public sources

Model result pages