Benchmark reference

ChessBench dataset and scoring methodology

Name: ChessBench benchmark dataset
Creator: ChessBench

ChessBench benchmarks LLMs on a fixed set of Lichess tactics and scores them using exact UCI move-line matches. This page is the canonical explanation of what the benchmark measures, where the data comes from, and how model scores are produced.

Dataset ID

Unavailable

Puzzle count

150

Tracks

Dataset generated

Unknown

How the benchmark is built

The benchmark uses a fixed dataset generated from Lichess tactics. Each puzzle has a position, a target line, a track label, and an optional source URL back to the original Lichess puzzle.

Models are prompted to return only the final answer as a UCI move line. Scores are based on exact match against the expected move sequence, which keeps comparisons strict and reproducible across providers.

ChessBench also records parse status, latency, token usage, and repair behavior, so benchmark pages can explain whether a miss came from reasoning, formatting, or parsing.

Scoring rules

Strict accuracy is the percentage of puzzles where the exact expected UCI line was returned.
Parsed accuracy counts exact matches only among outputs that were parseable.
Parse rate measures how often a model returned something that could be interpreted as a legal line.
ChessBench can recover strict UCI, loose UCI extraction, and SAN-to-UCI conversions before scoring.

Sources and benchmark pages

Public sources

Lichess database provides the underlying puzzle source data.
ChessBench on GitHub contains the dataset builder, scoring code, and benchmark results.

Model result pages

Grok 4.1 Fast benchmark page with 58.7% strict accuracy.
Gemini 3.1 Pro Preview benchmark page with 55.3% strict accuracy.
Grok 4.20 Beta benchmark page with 53.3% strict accuracy.
Gemini 3 Flash Preview benchmark page with 48.7% strict accuracy.
Gemini 3.1 Flash Image Preview benchmark page with 46.0% strict accuracy.