Prompt Lab
Replaces trial-and-error prompt engineering with structured methodology: objective definition, current prompt analysis, variant generation (instruction clarity, example strategies, output format specification), evaluation rubric design, test case creation, and failure mode identification.
Reference Files
| File | Contents | Load When |
|---|---|---|
references/prompt-patterns.md | Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output | Always |
references/evaluation-metrics.md | Quality metrics (accuracy, format compliance, completeness), rubric design | Evaluation needed |
references/failure-modes.md | Common prompt failure taxonomy, detection strategies, mitigations | Failure analysis requested |
references/output-constraints.md | Techniques for constraining LLM output format, JSON mode, schema enforcement | Format control needed |
Prerequisites
- Clear objective: what should the prompt accomplish?
- Target model (GPT-4, Claude, open-source) — prompting techniques vary by model
- Current prompt (if improving) or task description (if creating)
Workflow
Phase 1: Define Objective
- Task specification — What should the LLM produce? Be specific: "Classify customer support tickets into 5 categories" not "Handle support tickets."
- Success criteria — How do you know the output is correct? Define measurable criteria before writing any prompt.
- Failure modes — What does a bad output look like? Missing information? Wrong format? Hallucinated content? Refusal to answer?
Phase 2: Analyze Current Prompt
If an existing prompt is provided:
- Structure assessment — Is the instruction clear? Are examples provided? Is the output format specified?
- Ambiguity detection — Where could the model misinterpret the instruction?
- Missing components — What's not specified that should be? (output format, tone, length constraints, edge case handling)
- Failure mode mapping — Which known failure patterns (see
references/failure-modes.md) apply to this prompt?
Phase 3: Generate Variants
Create 2-4 prompt variants, each testing a different hypothesis:
| Variant Type | Hypothesis | When to Use |
|---|---|---|
| Direct instruction | Clear instruction is sufficient | Simple tasks, capable models |
| Few-shot | Examples improve output consistency | Pattern-following tasks |
| Chain-of-thought | Reasoning improves accuracy | Multi-step logic, math, analysis |
| Persona/role | Role framing improves tone/expertise | Domain-specific tasks |
| Structured output | Format specification prevents errors | JSON, CSV, specific templates |
For each variant:
- State the hypothesis (why this variant might work)
- Identify the risk (what could go wrong)
- Provide the complete prompt text
Phase 4: Design Evaluation
-
Rubric — Define weighted criteria:
Criterion What It Measures Typical Weight Correctness Output matches expected answer 30-50% Format compliance Follows specified structure 15-25% Completeness All required elements present 15-25% Conciseness No unnecessary content 5-15% Tone/style Matches requested voice 5-10% -
Test cases — Minimum 5 cases covering:
- Happy path (standard input)
- Edge cases (unusual but valid input)
- Adversarial cases (inputs designed to confuse)
- Boundary cases (minimum/maximum input)
Phase 5: Output
Present variants, rubric, and test cases in a structured format ready for execution.
Output Format
## Prompt Lab: {Task Name}
### Objective
{What the prompt should achieve — specific and measurable}
### Success Criteria
- [ ] {Criterion 1 — measurable}
- [ ] {Criterion 2 — measurable}
### Current Prompt Analysis
{If existing prompt provided}
- **Strengths:** {what works}
- **Weaknesses:** {what fails or is ambiguous}
- **Missing:** {what's not specified}
### Variants
#### Variant A: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant B: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant C: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
### Evaluation Rubric
| Criterion | Weight | Scoring |
|-----------|--------|---------|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |
### Test Cases
| # | Input | Expected Output | Tests Criteria |
|---|-------|-----------------|---------------|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |
### Failure Modes to Monitor
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}
### Recommended Next Steps
1. Run all variants against the test suite
2. Score using the rubric
3. Select the highest-scoring variant
4. Iterate on the winner with targeted improvements
Calibration Rules
- One variable per variant. Each variant should change ONE thing from the baseline. Changing instruction style AND examples AND format simultaneously makes results uninterpretable.
- Test before declaring success. A prompt that works on 3 examples may fail on the 4th. Minimum 5 diverse test cases before concluding a variant works.
- Failure modes are more valuable than successes. Understanding WHY a prompt fails guides improvement more than confirming it works.
- Model-specific optimization. A prompt optimized for GPT-4 may not work for Claude or Llama. Always note the target model.
- Simplest effective prompt wins. If a zero-shot prompt scores as well as a few-shot prompt, use the zero-shot. Fewer tokens = lower cost + latency.
Error Handling
| Problem | Resolution |
|---|---|
| No clear objective | Ask the user to define what "good output" looks like with 2-3 examples. |
| Prompt is for a task LLMs are bad at (math, counting) | Flag the limitation. Suggest tool-augmented approaches or pre/post-processing. |
| Too many variables to test | Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing. |
| No existing prompt to analyze | Start with the simplest possible prompt. The first variant IS the baseline. |
| Output format requirements are strict | Use structured output mode (JSON mode, function calling) instead of prompt-only constraints. |
When NOT to Use
Push back if:
- The task doesn't need an LLM (deterministic rules, regex, SQL) — use the right tool
- The user wants prompt execution, not design — this skill designs and evaluates, it doesn't run prompts
- The prompt is for safety-critical decisions without human review — LLM output should not be the sole input