prompt-engineer-toolkit

Prompt Engineer Toolkit - Production Prompt Engineering

Tier: POWERFUL Category: Engineering Tags: prompt engineering, chain-of-thought, few-shot, evaluation, testing, prompt versioning

Overview

Prompt Engineer Toolkit provides the complete lifecycle for production prompts: design patterns that work, testing frameworks that catch regressions, versioning systems that track changes, and evaluation rubrics that replace subjective "looks good" with measurable quality. This is not about clever tricks -- it is about treating prompts as production code with the same rigor.

Core Prompt Patterns

System Prompt Architecture

Every production prompt has a layered structure. Order matters.

┌──────────────────────────────────────┐ │ Layer 1: Identity & Role │ Who the model is │ "You are a senior code reviewer..." │ ├──────────────────────────────────────┤ │ Layer 2: Capabilities & Constraints │ What it can and cannot do │ "You can read files, run tests..." │ ├──────────────────────────────────────┤ │ Layer 3: Output Format │ How to structure responses │ "Always respond with JSON..." │ ├──────────────────────────────────────┤ │ Layer 4: Quality Standards │ What good output looks like │ "Include edge cases, cite sources" │ ├──────────────────────────────────────┤ │ Layer 5: Anti-Patterns │ What to avoid │ "Never fabricate citations..." │ ├──────────────────────────────────────┤ │ Layer 6: Examples │ Calibration via demonstration │ "Here is an example..." │ └──────────────────────────────────────┘

Layer Design Principles

Layer Principle Common Mistake

Identity Be specific about expertise level "You are an AI assistant" (too generic)

Capabilities Enumerate, don't imply Assuming model knows available tools

Output Format Show exact schema Describing format in prose instead of schema

Quality Standards Quantify when possible "Be thorough" (unquantifiable)

Anti-Patterns State the actual failure mode "Don't be wrong" (useless)

Examples Show edge cases, not just happy path Only showing trivial examples

Chain-of-Thought (CoT) Patterns

Standard CoT

Think through this step by step:

First, identify [what needs to be analyzed]
Then, evaluate [specific criteria]
Finally, synthesize [the conclusion]

Show your reasoning for each step.

When to use: Complex reasoning, math, multi-step logic When NOT to use: Simple classification, formatting tasks, creative writing

Structured CoT with Scratchpad

Use the following reasoning process:

List relevant facts
Identify applicable rules
Work through the logic
Check for edge cases </scratchpad>

Then provide your final answer outside the scratchpad tags.

Advantage: Model can reason messy, output is clean.

Self-Consistency CoT

Solve this problem three different ways, then compare your answers. If all three agree, that's your answer. If they disagree, identify which approach is most reliable and explain why.

When to use: High-stakes decisions where correctness matters more than speed. Cost: 3x token usage. Use selectively.

Few-Shot Design

Shot Selection Criteria

Criterion Good Example Bad Example

Representative Covers typical input pattern Only edge cases

Diverse Different input types/lengths All same structure

Edge-covering Includes tricky cases Only happy path

Output-calibrating Shows desired detail level Overly verbose or terse

Ordered Simple → complex progression Random order

Few-Shot Template

Here are examples of the expected input and output:

Example 1 (simple case): Input: [simple input] Output: [simple output with annotation]

Example 2 (typical case): Input: [typical input] Output: [typical output with annotation]

Example 3 (edge case): Input: [tricky input] Output: [correct handling with annotation]

Now process this: Input: {user_input} Output:

Dynamic Few-Shot Selection

For production systems with thousands of examples:

Embed all examples
Embed the current input
Find K nearest examples by embedding similarity
Include those K examples as shots
Typical K: 3-5 (diminishing returns after 5)
Output Structuring Patterns

JSON Mode with Schema

Respond with a JSON object matching this exact schema:

{ "analysis": { "summary": "string - one sentence summary", "severity": "string - one of: critical, high, medium, low", "findings": [ { "issue": "string - description of the issue", "location": "string - file:line", "fix": "string - recommended fix", "confidence": "number - 0.0 to 1.0" } ], "overall_score": "number - 0 to 100" } }

Rules:

findings array must have at least one entry
confidence must reflect actual certainty, not optimism
overall_score: 90-100 (excellent), 70-89 (good), 50-69 (needs work), <50 (poor)

Structured Reasoning with Sections

Structure your response with these exact sections:

Assessment

[1-2 sentence bottom line]

Evidence

[Specific observations supporting the assessment]

Risks

[What could go wrong, with likelihood estimates]

Recommendation

[Specific actionable next steps with owners]

Prompt Decomposition

Complex prompts that try to do everything fail. Decompose them.

Single Responsibility Prompts

Bad (monolithic) Good (decomposed)

"Review this code for bugs, style, performance, security, and suggest improvements" Prompt 1: "Identify bugs" / Prompt 2: "Check style" / Prompt 3: "Find performance issues" / Prompt 4: "Security audit" / Prompt 5: "Synthesize findings"

Pipeline Pattern

Prompt 1 (Extract): Input → structured data Prompt 2 (Analyze): Structured data → findings Prompt 3 (Synthesize): Findings → recommendation Prompt 4 (Format): Recommendation → user-facing output

Each prompt is testable independently. A failure in Prompt 2 doesn't require re-running Prompt 1.

Calibration Techniques

Temperature Guidelines

Task Type Temperature Rationale

Code generation 0.0-0.2 Correctness > creativity

Classification 0.0 Deterministic expected

Analysis/reasoning 0.2-0.5 Some flexibility in framing

Creative writing 0.7-1.0 Diversity of expression

Brainstorming 0.8-1.2 Maximum variety

Confidence Calibration

For each finding, rate your confidence:

Confidence levels:

VERIFIED: I can point to specific evidence in the provided context
LIKELY: Strong inference from available information
UNCERTAIN: Reasonable guess, but limited evidence
SPECULATIVE: Possible but I'm reaching

Never state SPECULATIVE findings as VERIFIED.

Prompt Testing Framework

Test Case Design

Every production prompt needs a test suite.

Test Case Structure

{ "test_id": "classify-urgent-001", "input": "Server is down, customers can't access the product", "expected": { "contains": ["critical", "immediate"], "not_contains": ["low priority", "can wait"], "format_regex": "^\{.*\}$", "max_tokens": 500, "required_fields": ["severity", "category"] }, "tags": ["classification", "urgency", "happy-path"] }

Test Suite Composition

Category % of Suite Purpose

Happy path 40% Confirm basic functionality works

Edge cases 30% Boundary conditions, unusual inputs

Adversarial 15% Inputs designed to break the prompt

Regression 15% Cases that previously failed

Evaluation Rubric

Automated Scoring

Dimension Measurement Weight

Adherence Contains required elements, matches schema 30%

Accuracy Correct classification/analysis/answer 30%

Safety No forbidden content, no hallucinations 20%

Format Matches expected structure, length bounds 10%

Relevance Response addresses the actual input 10%

Scoring Formula

score = (adherence * 0.30) + (accuracy * 0.30) + (safety * 0.20) + (format * 0.10) + (relevance * 0.10)

Pass threshold: 0.80 Warning threshold: 0.70 Fail threshold: < 0.70

Regression Testing Protocol

Before any prompt change:
- Run full test suite against current prompt (baseline)
- Record scores per test case
After prompt change:
- Run same test suite against new prompt (candidate)
- Compare scores per test case
Acceptance criteria:
- Average score: candidate >= baseline
- No individual test case drops by more than 10%
- Zero safety violations (any safety failure = reject)
- If criteria met: promote candidate
- If criteria not met: iterate on prompt or reject

Prompt Versioning

Version Control Strategy

prompts/ ├── support-classifier/ │ ├── v1.txt # Original version │ ├── v2.txt # Added edge case handling │ ├── v3.txt # Current production │ ├── changelog.md # Change log with rationale │ └── tests/ │ ├── suite.json # Test cases │ └── baselines/ │ ├── v1-results.json │ ├── v2-results.json │ └── v3-results.json ├── code-reviewer/ │ ├── v1.txt │ └── ...

Changelog Format

v3 (2026-03-09)

Author: borghei Change: Added explicit handling for multi-language inputs Reason: v2 defaulted to English analysis for non-English code comments Test results: Average score 0.87 (v2 was 0.82). No regressions. Rollback plan: Revert to v2.txt

v2 (2026-02-15)

Author: borghei Change: Added structured output format with JSON schema Reason: Downstream parser needed consistent format Test results: Average score 0.82 (v1 was 0.79). Format compliance 100% (v1 was 73%).

Prompt Diff Analysis

Before deploying a new version, always diff:

Key questions for prompt diffs:

Were any constraints removed? (Risk: safety regression)
Were any examples changed? (Risk: calibration shift)
Was the output format changed? (Risk: downstream parser breaks)
Were any anti-patterns removed? (Risk: known failure modes return)
Is the new prompt longer? (Risk: context budget impact)

Common Prompt Failure Modes

Failure Mode Symptom Fix

Instruction override Model ignores constraints Move constraints earlier, add "CRITICAL:" prefix

Format drift Output structure varies between calls Add JSON schema, reduce temperature

Sycophancy Model agrees with wrong premise Add "Challenge assumptions" instruction

Verbosity bloat Output too long, buries the answer Add word/token limits, "be concise"

Hallucination Fabricated facts, citations, or code Add "Only reference provided context"

Anchoring First example dominates output style Diversify examples, add "each input is independent"

Lost in the middle Middle instructions get ignored Front-load and back-load critical instructions

Workflows

Workflow 1: Design a Production Prompt

Define the task precisely (input type, output type, quality criteria)
Write the system prompt using the 6-layer architecture
Create 10+ test cases (40% happy, 30% edge, 15% adversarial, 15% regression)
Run test suite, score results
Iterate until passing threshold (0.80+)
Version as v1, record baseline scores
Deploy with monitoring

Workflow 2: Debug a Degraded Prompt

Identify which test cases are failing
Categorize failures (format? accuracy? safety? relevance?)
Check: did the model change? (API version, model update)
Check: did the input distribution change? (new edge cases)
Check: was the prompt modified? (diff against last known good)
Fix the root cause (not the symptom)
Run full regression suite before deploying fix

Workflow 3: Migrate Prompt to New Model

Run full test suite on current model (baseline)
Run same suite on new model (no prompt changes)
Compare: if scores are equivalent, done
If scores drop: identify which dimensions degraded
Adjust prompt for new model's behavior patterns
Re-run suite until scores meet or exceed baseline
Document model-specific adjustments in changelog

Integration Points

Skill Integration

self-improving-agent Prompts that degrade are a regression signal; test them

agent-designer Agent system prompts are the highest-stakes prompts to test

context-engine Context retrieval quality directly affects prompt effectiveness

ab-test-setup A/B test prompt variants in production with statistical rigor

References

references/prompt-patterns-catalog.md
Complete catalog of prompting techniques with examples
references/evaluation-rubric-templates.md
Reusable evaluation rubrics by task type
references/model-specific-behaviors.md
Known behavior differences across model families