Benchmark Commands
The benchmark command group runs reproducible extraction benchmarks across
LLM models and datasets, then renders a leaderboard. Each run is fully
local — no API key is required when --local-only is set.
chaoscypher benchmark --help
A benchmark config names the seed, temperature, datasets, and models for a
run. Configs ship as built-ins (e.g. extraction, quick) and can be
overlaid with user configs at <data_dir>/benchmark/config/<name>.yaml.
Datasets are referenced by id and discovered from both built-in locations and
<data_dir>/benchmark/datasets/.
Run a Benchmark
Execute a named config end-to-end:
# Run the default config (`extraction`)
chaoscypher benchmark run
# Run a specific config
chaoscypher benchmark run quick
# Run a config but only on one dataset
chaoscypher benchmark run extraction --dataset war_and_peace_tiny
# Run with no commercial models — purely local Ollama
chaoscypher benchmark run extraction --local-only
# Override seed / temperature for ad-hoc comparisons
chaoscypher benchmark run extraction --seed 17 --temperature 0.2
# Preserve the per-run temp DBs for post-hoc inspection
chaoscypher benchmark run extraction --keep-db
# Write outputs somewhere specific (default: <data_dir>/benchmark/results/)
chaoscypher benchmark run extraction --out ./bench-out
Arguments:
| Argument | Description |
|---|---|
NAME | Config name. Optional — defaults to extraction. |
Options:
| Flag | Type | Default | Description |
|---|---|---|---|
--dataset ID | string | All datasets in config | Run only one dataset id from the config's datasets list. |
--local-only | flag | off | Drop commercial-provider models so the run is free / API-key-free. |
--seed N | int | Config value | Override the config's seed. |
--temperature F | float | Config value | Override the config's temperature. |
--keep-db | flag | off | Preserve per-run temp databases for inspection. |
--out DIR | path | <data_dir>/benchmark/results/ | Output directory for JSON and Markdown. |
Output:
A run writes three files into the output directory:
<timestamp>.json— machine-readable result rows<timestamp>.md— rendered Markdown leaderboardlatest.md— overwritten on every run for quickcat/preview
List Configs and Datasets
Show every config the runner recognizes (built-in plus any user overlay), and every dataset configs can reference:
chaoscypher benchmark list
The output is two tables:
- Configs — name, source (
builtin/user), description. - Datasets — id, source, kind, version, domain, corpus filename.
Run this before benchmark run to see what's available, and after benchmark init to confirm the new user config is picked up.
Show a Saved Leaderboard
Re-render an existing results JSON as a Markdown leaderboard, without re-running the benchmark:
# Print to stdout
chaoscypher benchmark show ./bench-out/2026-04-29T1530Z.json
# Write to a file
chaoscypher benchmark show ./bench-out/2026-04-29T1530Z.json --out leaderboard.md
Arguments:
| Argument | Description |
|---|---|
RESULTS_PATH | Path to a results JSON file produced by benchmark run. |
Options:
| Flag | Type | Default | Description |
|---|---|---|---|
--out PATH | path | stdout | Write the rendered Markdown to this path instead of stdout. |
Scaffold a User Config
Drop a starter config under <data_dir>/benchmark/config/<name>.yaml for local
editing. After init, benchmark run <name> loads the user config (which
overrides any built-in with the same name).
# Create a new config named "smoke"
chaoscypher benchmark init smoke
# Overwrite an existing user config
chaoscypher benchmark init smoke --force
Arguments:
| Argument | Description |
|---|---|
NAME | Config name. Becomes the file name and the value of the name: field. |
Options:
| Flag | Type | Default | Description |
|---|---|---|---|
--force | flag | off | Overwrite an existing user config with the same name. |
The starter config references the war_and_peace_tiny built-in dataset and a
single Ollama model, so the resulting benchmark run <name> runs end-to-end
without further setup.
Workflow
A typical benchmarking session:
# 1. See what's already available
chaoscypher benchmark list
# 2. Smoke-test the runner on a small built-in config
chaoscypher benchmark run quick --local-only
# 3. Scaffold a user config for a custom comparison
chaoscypher benchmark init my-comparison
# 4. Edit <data_dir>/benchmark/config/my-comparison.yaml,
# add datasets / models, then run it
chaoscypher benchmark run my-comparison --local-only
# 5. Re-render the saved leaderboard later without re-running
chaoscypher benchmark show <data_dir>/benchmark/results/<timestamp>.json