Benchmark reference
ChessBench dataset and scoring methodology
ChessBench benchmarks LLMs on a fixed set of Lichess tactics and scores them using exact UCI move-line matches. This page is the canonical explanation of what the benchmark measures, where the data comes from, and how model scores are produced.
Dataset ID
Unavailable
Puzzle count
150
Tracks
5
Dataset generated
Unknown
How the benchmark is built
The benchmark uses a fixed dataset generated from Lichess tactics. Each puzzle has a position, a target line, a track label, and an optional source URL back to the original Lichess puzzle.
Models are prompted to return only the final answer as a UCI move line. Scores are based on exact match against the expected move sequence, which keeps comparisons strict and reproducible across providers.
ChessBench also records parse status, latency, token usage, and repair behavior, so benchmark pages can explain whether a miss came from reasoning, formatting, or parsing.
Scoring rules
- Strict accuracy is the percentage of puzzles where the exact expected UCI line was returned.
- Parsed accuracy counts exact matches only among outputs that were parseable.
- Parse rate measures how often a model returned something that could be interpreted as a legal line.
- ChessBench can recover strict UCI, loose UCI extraction, and SAN-to-UCI conversions before scoring.
Sources and benchmark pages
Public sources
- Lichess database provides the underlying puzzle source data.
- ChessBench on GitHub contains the dataset builder, scoring code, and benchmark results.
Model result pages
- Grok 4.1 Fast benchmark page with 58.7% strict accuracy.
- Gemini 3.1 Pro Preview benchmark page with 55.3% strict accuracy.
- Grok 4.20 Beta benchmark page with 53.3% strict accuracy.
- Gemini 3 Flash Preview benchmark page with 48.7% strict accuracy.
- Gemini 3.1 Flash Image Preview benchmark page with 46.0% strict accuracy.