autooptimise
Autonomous benchmark-driven skill optimisation for OpenClaw. Inspired by Andrej Karpathy's autoresearch — the same modify → test → score → keep/discard loop, applied to agent skill quality instead of GPU training.
Trigger Phrases
"optimise my weather skill""run autooptimise on [skill-name]""benchmark my [skill-name] skill""improve my skill overnight"
Key Files
| File | Purpose |
|---|---|
benchmark/tasks.json | Test task suite (prompts + expected qualities) |
benchmark/scorer.md | LLM judge scoring rubric |
runner/run_experiment.md | Autonomous loop instructions (load this next) |
runner/experiment_log.md | Auto-created run log (gitignored) |
How to Run
- Read
runner/run_experiment.md— it contains the full loop instructions - Confirm the target skill with the user if not specified
- Execute the loop (max 3 iterations)
- Present proposed changes for human approval — never auto-apply
Scoring
Use the best available LLM judge model (prefer a strong reasoning model). Score each task 0–10 on:
- Accuracy — correct answer / correct tool called
- Conciseness — no padding, no unnecessary text
- Tool usage — right tool, right parameters
- Formatting — output matches expected format
Full rubric: benchmark/scorer.md
Safety Rules
- Never auto-apply changes. Always present a diff and wait for explicit human approval.
- Never modify
benchmark/tasks.jsonorbenchmark/scorer.mdduring a run. - Never exceed 3 iterations per run in v0.1.
- Log every action to
runner/experiment_log.md.