Skill Evaluation¶
Every skill in this repository is treated as a prompt that must be empirically tuned before release. This page is the practical how-to. The method itself lives in the empirical-prompt-tuning skill, adapted from mizchi/skills/empirical-prompt-tuning.
Why bother¶
The author of a skill cannot judge its quality. Re-reading your own draft from "the same head" cannot detect the ambiguities that will trip up a fresh agent. The only reliable test is to dispatch a bias-free executor, score against a frozen requirements checklist, and iterate until improvements plateau.
When to run an evaluation¶
- After authoring a new skill.
- After substantially revising a skill's body or
description. - Before promoting a skill from draft to published (i.e. before merging to main).
- When a user reports unexpected behavior and you suspect instruction-side ambiguity.
Skip evaluation only for typo / link / formatting fixes — and use [skip-eval] in the PR description so CI knows.
End-to-end example: evaluating the review skill¶
Concrete walkthrough of one full iteration. Every other skill follows the same shape.
Step 1 — scaffold the eval directory¶
bash scripts/eval-skill.sh init review
Creates:
evals/review/
├── scenarios.yaml # to be filled
├── runs/ # timestamped run notes will land here
└── ledger.md # dated iteration summaries
Step 2 — write evals/review/scenarios.yaml¶
Pick 2–3 scenarios (1 typical + 1 edge minimum). For each, list 3–7 requirements, with at least one tagged [critical]. Once a run records against this file, do not edit checklists retroactively — that turns the eval into a vibes check.
skill: review
scenarios:
- id: typical-pr
kind: typical
prompt: |
Review the current branch as if it were a PR adding rate limiting
to src/api/auth.ts. The branch adds a new middleware, modifies
src/api/server.ts to wire it in, and adds 2 tests. Produce a
severity-ranked review.
requirements:
- text: Findings are listed in severity order (Critical → Info).
critical: true
- text: Each finding cites a file:line.
critical: true
- text: A one-line verdict (Ready to Merge / Needs Attention / Needs Work) is included.
critical: true
- text: Style nits are not raised unless they obscure correctness.
critical: false
- id: empty-diff
kind: edge
prompt: |
Review the current branch. The branch has zero commits relative to main.
requirements:
- text: The skill recognizes the empty diff and reports it explicitly.
critical: true
- text: No invented findings.
critical: true
Step 3 — create a run note¶
bash scripts/eval-skill.sh new-run review
This creates evals/review/runs/<timestamp>.md from the template, pre-filled with the current skill commit hash and iteration number.
Step 4 — ask Claude to run the evaluation¶
Open Claude Code in the repo and say:
Use the
empirical-prompt-tuningskill to evaluate thereviewskill for one iteration. Readevals/review/scenarios.yaml, dispatch one fresh subagent per scenario via the Task tool (in parallel), and fill inevals/review/runs/<the new file>.md. Do not paste the skill body to the subagents — they must read it cold.
Claude will:
- Read
scenarios.yaml. - Spawn one
general-purposesubagent per scenario in a single message (parallel). - Pass each subagent only the scenario
promptfield. Not the SKILL.md content. - Collect the subagents' returns plus
tool_uses/duration_msfrom the Task usage meta. - Score two-sided into the run file:
- Executor self-report (phase-tagged: Understanding / Planning / Execution / Formatting)
- Instruction-side metrics: pass/fail (only if all
[critical]items pass), accuracy %, tool steps, duration ms, retry count - Structured reflection per unclear point:
Issue / Cause / General Fix Rule
Step 5 — apply the minimum fix¶
Read the filled-in run file. Before editing the SKILL.md, write down which checklist item the fix is meant to satisfy. Then make the smallest change to skills/review/SKILL.md that addresses it. One theme per iteration; unrelated cleanups go to the next iteration.
Consult evals/review/ledger.md first. If the same General Fix Rule keeps appearing, the existing fix is in the wrong place — move it before adding a new ledger entry.
Step 6 — re-run with a NEW subagent¶
bash scripts/eval-skill.sh new-run review
Then ask Claude again, the same prompt as Step 4. Do not reuse the previous subagent — it has already read the prior text and the eval becomes a reading-comprehension test of the previous version.
Step 7 — converge¶
Stop when:
- Two consecutive iterations produce zero new unclear points, AND
- Accuracy / step count / duration improvements drop below 5%.
For high-importance skills (e.g. anything in safety-pack or verification-pack), require three consecutive plateau iterations.
Step 8 — record in the ledger¶
Append one line to evals/review/ledger.md:
- 2026-05-09: iter 3, 2/2 scenarios pass, accuracy 100%, plateau confirmed (2 consecutive)
The CI gate looks for an entry like this.
CI quality gate¶
Every PR that modifies skills/<name>/SKILL.md is checked by .github/workflows/eval-quality-gate.yml. The gate runs scripts/check-eval-coverage.py, which:
- Lists changed
SKILL.mdfiles viagit diffagainst the base ref. - For each, requires
evals/<name>/ledger.mdto contain a dated entry within the last 14 days that mentionspass,plateau, orconverged. - Bypasses cleanly when
[skip-eval]appears in the PR body (use only for trivial changes — typos, link fixes — with a one-line justification).
If the gate fails, run another iteration, append a fresh ledger entry, and push.
Status overview¶
bash scripts/eval-skill.sh status # all skills with eval dirs
bash scripts/eval-skill.sh status review # one specific skill
Output shows the run count and the most recent dated ledger entry.
Prompt template for Claude¶
Drop this into a Claude Code session whenever you start an iteration. Replace <skill> with the skill under evaluation.
Use the
empirical-prompt-tuningskill to run iteration N of evaluating<skill>. 1. Readevals/<skill>/scenarios.yaml. 2. Usebash scripts/eval-skill.sh new-run <skill>to create the run file (or use the latest empty one if I already created it). 3. Dispatch onegeneral-purposesubagent per scenario in parallel via the Task tool. Pass each subagent ONLY the scenariopromptfield — do not paste or summarize the SKILL.md. 4. Capture from each return: full subagent response,tool_uses,duration_ms. 5. Fill the run file with two-sided scoring (executor self-report phase-tagged, instruction-side metrics, structured reflection per unclear point). 6. Propose the minimum fix toskills/<skill>/SKILL.md, citing the checklist item it addresses. Do not apply yet. 7. End with the proposed diff and ask me to approve before writing.
Anti-patterns¶
- Re-reading your own draft and "deciding it's clear." Your head has the missing context already. Dispatch a fresh subagent.
- Editing the requirements checklist after seeing the run. Converts the eval to a vibes check.
- Reusing the same subagent across iterations. It has learned the prior text.
- Adding a new ledger entry for a recurring class of failure. That just multiplies notes. Move the existing fix to a more prominent position first.
- Pasting the SKILL.md into the subagent's prompt. It must read the skill the way a future user would — cold.
Attribution¶
The method comes from mizchi/skills/empirical-prompt-tuning. To use the upstream skill verbatim instead of our adaptation:
apm install -g mizchi/skills/empirical-prompt-tuning