Use scripts/model_tester.py to run repeatable test prompts and compare requested vs actual model usage from OpenClaw logs.
Run
From the skill directory (or pass absolute paths):
python3 scripts/model_tester.py --agent menial --case extract-emails
python3 scripts/model_tester.py --model openai/gpt-4.1 --case math-reasoning
python3 scripts/model_tester.py --agent chat --model openai/gpt-4.1 --case all --out /tmp/model-test.json
Inputs
--agent <name>: Target agent (chat, menial, coder, etc.)--model <name>: Requested model alias/name to test--case <id|all>: Case fromreferences/test-cases.jsonorall--timeout <sec>: Per-case timeout (default120)--out <file>: Optional JSON output file
Require at least one of --agent or --model.
What the runner does
- Load test cases from
references/test-cases.json. - Start
openclaw logs --follow --jsonin parallel. - Run
openclaw agent --jsonwith a bounded test prompt (asks agent to use a subagent for the task). - Parse response + tailed logs.
- Emit machine-readable JSON and a short human summary.
Output format
Top-level JSON:
tooltimestampagentrequested_modelresults[]
Each result entry returns:
test_caseagentrequested_modelactual_model(parsed from logs when available)status(ok/error)result_summaryruntime_secondstokens(when discoverable)errors[]
Privacy & Safety
The tester spawns isolated subagent tasks with predefined test prompts only — no user data is passed to models. It tails OpenClaw logs to extract:
- which model was actually selected (routing validation)
- token usage statistics
- runtime metrics
Log extraction uses regex patterns to find model/token fields. No personally identifiable information or arbitrary log content is captured — only structured fields related to the test execution.
Notes
- Model extraction and token extraction are best-effort because log fields may vary by OpenClaw/provider version.
- If
openclawconfig is invalid or gateway is unavailable, the script returnsstatus=errorwith stderr details. - Edit
references/test-cases.jsonto add custom prompts for your benchmark set. - All test cases are generic; no workspace or user data is baked in.