Skip to main content

Benchmark Commands

The benchmark command group runs reproducible extraction benchmarks across LLM models and datasets, then renders a leaderboard. Each run is fully local — no API key is required when --local-only is set.

chaoscypher benchmark --help

A benchmark config names the seed, temperature, datasets, and models for a run. Configs ship as built-ins (e.g. extraction, quick) and can be overlaid with user configs at <data_dir>/benchmark/config/<name>.yaml. Datasets are referenced by id and discovered from both built-in locations and <data_dir>/benchmark/datasets/.


Run a Benchmark

Execute a named config end-to-end:

# Run the default config (`extraction`)
chaoscypher benchmark run

# Run a specific config
chaoscypher benchmark run quick

# Run a config but only on one dataset
chaoscypher benchmark run extraction --dataset war_and_peace_tiny

# Run with no commercial models — purely local Ollama
chaoscypher benchmark run extraction --local-only

# Override seed / temperature for ad-hoc comparisons
chaoscypher benchmark run extraction --seed 17 --temperature 0.2

# Preserve the per-run temp DBs for post-hoc inspection
chaoscypher benchmark run extraction --keep-db

# Write outputs somewhere specific (default: <data_dir>/benchmark/results/)
chaoscypher benchmark run extraction --out ./bench-out

Arguments:

ArgumentDescription
NAMEConfig name. Optional — defaults to extraction.

Options:

FlagTypeDefaultDescription
--dataset IDstringAll datasets in configRun only one dataset id from the config's datasets list.
--local-onlyflagoffDrop commercial-provider models so the run is free / API-key-free.
--seed NintConfig valueOverride the config's seed.
--temperature FfloatConfig valueOverride the config's temperature.
--keep-dbflagoffPreserve per-run temp databases for inspection.
--out DIRpath<data_dir>/benchmark/results/Output directory for JSON and Markdown.

Output:

A run writes three files into the output directory:

  • <timestamp>.json — machine-readable result rows
  • <timestamp>.md — rendered Markdown leaderboard
  • latest.md — overwritten on every run for quick cat/preview

List Configs and Datasets

Show every config the runner recognizes (built-in plus any user overlay), and every dataset configs can reference:

chaoscypher benchmark list

The output is two tables:

  • Configs — name, source (builtin / user), description.
  • Datasets — id, source, kind, version, domain, corpus filename.

Run this before benchmark run to see what's available, and after benchmark init to confirm the new user config is picked up.


Show a Saved Leaderboard

Re-render an existing results JSON as a Markdown leaderboard, without re-running the benchmark:

# Print to stdout
chaoscypher benchmark show ./bench-out/2026-04-29T1530Z.json

# Write to a file
chaoscypher benchmark show ./bench-out/2026-04-29T1530Z.json --out leaderboard.md

Arguments:

ArgumentDescription
RESULTS_PATHPath to a results JSON file produced by benchmark run.

Options:

FlagTypeDefaultDescription
--out PATHpathstdoutWrite the rendered Markdown to this path instead of stdout.

Scaffold a User Config

Drop a starter config under <data_dir>/benchmark/config/<name>.yaml for local editing. After init, benchmark run <name> loads the user config (which overrides any built-in with the same name).

# Create a new config named "smoke"
chaoscypher benchmark init smoke

# Overwrite an existing user config
chaoscypher benchmark init smoke --force

Arguments:

ArgumentDescription
NAMEConfig name. Becomes the file name and the value of the name: field.

Options:

FlagTypeDefaultDescription
--forceflagoffOverwrite an existing user config with the same name.

The starter config references the war_and_peace_tiny built-in dataset and a single Ollama model, so the resulting benchmark run <name> runs end-to-end without further setup.


Workflow

A typical benchmarking session:

# 1. See what's already available
chaoscypher benchmark list

# 2. Smoke-test the runner on a small built-in config
chaoscypher benchmark run quick --local-only

# 3. Scaffold a user config for a custom comparison
chaoscypher benchmark init my-comparison

# 4. Edit <data_dir>/benchmark/config/my-comparison.yaml,
# add datasets / models, then run it
chaoscypher benchmark run my-comparison --local-only

# 5. Re-render the saved leaderboard later without re-running
chaoscypher benchmark show <data_dir>/benchmark/results/<timestamp>.json