empirical-prompt-tuning¶
Iteratively improve a skill or prompt by having a fresh subagent execute it against fixed scenarios, scoring two-sided (executor self-report + a frozen requirements checklist), and looping until improvements plateau.
When to use¶
- After authoring or substantially revising a skill, slash command, subagent prompt, or CLAUDE.md addition.
- Before promoting a skill from draft to published.
- When a user reports unexpected skill behavior and you suspect instruction-side ambiguity.
When not to use¶
- One-off throwaway prompts.
- Trivial wording changes.
- Personal taste disagreements (the method scores against frozen requirements, not preferences).
Loop¶
- Description / body consistency check (no dispatch).
- Freeze 2–3 scenarios with
[critical]-tagged requirements inevals/<skill>/scenarios.yaml. - Dispatch a fresh subagent per scenario via the
Tasktool. - Score two-sided: executor self-report (phase-tagged) + instruction-side metrics (pass/fail, accuracy, steps, duration, retries).
- Apply the minimum fix; consult
evals/<skill>/ledger.mdfirst for recurring patterns. - Re-run with a new subagent.
- Stop when two consecutive iterations plateau (three for high-importance skills).
- Append a passing entry to
evals/<skill>/ledger.md.
Bookkeeping helpers¶
bash scripts/eval-skill.sh init <skill> # scaffold evals/<skill>/
bash scripts/eval-skill.sh new-run <skill> # create timestamped run note
bash scripts/eval-skill.sh status [<skill>] # show iteration / last ledger entry
The full flow lives in Skill Evaluation. The CI quality gate (eval-quality-gate.yml) verifies every modified skill has a recent passing ledger entry.
Attribution¶
Method from mizchi/skills/empirical-prompt-tuning. For the upstream skill verbatim:
apm install -g mizchi/skills/empirical-prompt-tuning
Install¶
gh skill install s-hiraoku/claude-harnesses empirical-prompt-tuning --scope project
# or
npx skills add s-hiraoku/claude-harnesses --skill empirical-prompt-tuning