empirical-prompt-tuning¶

Iteratively improve a skill or prompt by having a fresh subagent execute it against fixed scenarios, scoring two-sided (executor self-report + a frozen requirements checklist), and looping until improvements plateau.

When to use¶

After authoring or substantially revising a skill, slash command, subagent prompt, or CLAUDE.md addition.
Before promoting a skill from draft to published.
When a user reports unexpected skill behavior and you suspect instruction-side ambiguity.

When not to use¶

One-off throwaway prompts.
Trivial wording changes.
Personal taste disagreements (the method scores against frozen requirements, not preferences).

Loop¶

Description / body consistency check (no dispatch).
Freeze 2–3 scenarios with [critical]-tagged requirements in evals/<skill>/scenarios.yaml.
Dispatch a fresh subagent per scenario via the Task tool.
Score two-sided: executor self-report (phase-tagged) + instruction-side metrics (pass/fail, accuracy, steps, duration, retries).
Apply the minimum fix; consult evals/<skill>/ledger.md first for recurring patterns.
Re-run with a new subagent.
Stop when two consecutive iterations plateau (three for high-importance skills).
Append a passing entry to evals/<skill>/ledger.md.

Bookkeeping helpers¶

bash scripts/eval-skill.sh init <skill>      # scaffold evals/<skill>/
bash scripts/eval-skill.sh new-run <skill>   # create timestamped run note
bash scripts/eval-skill.sh status [<skill>]  # show iteration / last ledger entry

The full flow lives in Skill Evaluation. The CI quality gate (eval-quality-gate.yml) verifies every modified skill has a recent passing ledger entry.

Attribution¶

Method from mizchi/skills/empirical-prompt-tuning. For the upstream skill verbatim:

apm install -g mizchi/skills/empirical-prompt-tuning

Install¶

gh skill install s-hiraoku/claude-harnesses empirical-prompt-tuning --scope project
# or
npx skills add s-hiraoku/claude-harnesses --skill empirical-prompt-tuning