Multi-Model Response Comparator
Compare answers from multiple AI models for the same prompt, then summarize tradeoffs across quality, style, and likely use cases.
When to use
- choosing between models for a workflow
- benchmarking prompt behavior
- checking whether a stronger model is worth the cost
- generating second opinions on important outputs
Recommended runtime
This skill works with OpenAI-compatible runtimes and has been tested on Crazyrouter.
Required output format
Always structure the final comparison with these sections:
- Task summary
- Models compared
- Strengths by model
- Weaknesses by model
- Best model by use case
- Cost/latency sensitivity note
- Final recommendation
Suggested workflow
- pick 2-4 models
- run the same prompt on each model
- compare structure, depth, correctness, tone, and likely latency/cost
- score or describe tradeoffs using the comparison rubric
- produce a recommendation by use case, not just one universal winner
Comparison rules
- Use the same prompt and same success criteria for all models.
- Do not claim exact cost or latency unless the user provides them.
- If metrics are inferred, label them as likely or expected.
- Separate writing quality from factual reliability.
- For coding tasks, prioritize correctness, edge cases, and implementation completeness.
Example prompts
- Compare GPT, Claude, and Gemini on this support email draft.
- Run this coding prompt across three models and summarize which one is most production-ready.
- Compare low-cost vs premium models for a blog outline task.
References
Read these when preparing the final comparison:
references/comparison-rubric.mdreferences/example-prompts.md
Crazyrouter example
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://crazyrouter.com/v1"
)
Recommended artifacts
- catalog.json
- provenance.json
- market-manifest.json
- evals/evals.json