Senior Prompt Engineer

Prompt engineering patterns, LLM evaluation frameworks, and agentic system design.

Table of Contents

Quick Start
Tools Overview
Prompt Optimizer
RAG Evaluator
Agent Orchestrator
Prompt Engineering Workflows
Prompt Optimization Workflow
Few-Shot Example Design
Structured Output Design
Reference Documentation
Common Patterns Quick Reference

Quick Start

Analyze and optimize a prompt file

python scripts/prompt_optimizer.py prompts/my_prompt.txt --analyze

Evaluate RAG retrieval quality

python scripts/rag_evaluator.py --contexts contexts.json --questions questions.json

Visualize agent workflow from definition

python scripts/agent_orchestrator.py agent_config.yaml --visualize

Tools Overview

Prompt Optimizer

Analyzes prompts for token efficiency, clarity, and structure. Generates optimized versions.

Input: Prompt text file or string Output: Analysis report with optimization suggestions

Usage:

Analyze a prompt file

python scripts/prompt_optimizer.py prompt.txt --analyze

Output:

Token count: 847

Estimated cost: $0.0025 (GPT-4)

Clarity score: 72/100

Issues found:

- Ambiguous instruction at line 3

- Missing output format specification

- Redundant context (lines 12-15 repeat lines 5-8)

Suggestions:

1. Add explicit output format: "Respond in JSON with keys: ..."

2. Remove redundant context to save 89 tokens

3. Clarify "analyze" -> "list the top 3 issues with severity ratings"

Generate optimized version

python scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt

Count tokens for cost estimation

python scripts/prompt_optimizer.py prompt.txt --tokens --model gpt-4

Extract and manage few-shot examples

python scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json

RAG Evaluator

Evaluates Retrieval-Augmented Generation quality by measuring context relevance and answer faithfulness.

Input: Retrieved contexts (JSON) and questions/answers Output: Evaluation metrics and quality report

Usage:

Evaluate retrieval quality

python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json

Output:

=== RAG Evaluation Report ===

Questions evaluated: 50

Retrieval Metrics:

Context Relevance: 0.78 (target: >0.80)

Retrieval Precision@5: 0.72

Coverage: 0.85

Generation Metrics:

Answer Faithfulness: 0.91

Groundedness: 0.88

Issues Found:

- 8 questions had no relevant context in top-5

- 3 answers contained information not in context

Recommendations:

1. Improve chunking strategy for technical documents

2. Add metadata filtering for date-sensitive queries

Evaluate with custom metrics

python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
--metrics relevance,faithfulness,coverage

Export detailed results

python scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
--output report.json --verbose

Agent Orchestrator

Parses agent definitions and visualizes execution flows. Validates tool configurations.

Input: Agent configuration (YAML/JSON) Output: Workflow visualization, validation report

Usage:

Validate agent configuration

python scripts/agent_orchestrator.py agent.yaml --validate

Output:

=== Agent Validation Report ===

Agent: research_assistant

Pattern: ReAct

Tools (4 registered):

[OK] web_search - API key configured

[OK] calculator - No config needed

[WARN] file_reader - Missing allowed_paths

[OK] summarizer - Prompt template valid

Flow Analysis:

Max depth: 5 iterations

Estimated tokens/run: 2,400-4,800

Potential infinite loop: No

Recommendations:

1. Add allowed_paths to file_reader for security

2. Consider adding early exit condition for simple queries

Visualize agent workflow (ASCII)

python scripts/agent_orchestrator.py agent.yaml --visualize

Output:

┌─────────────────────────────────────────┐

│ research_assistant │

│ (ReAct Pattern) │

└─────────────────┬───────────────────────┘

│

┌────────▼────────┐

│ User Query │

└────────┬────────┘

│

┌────────▼────────┐

│ Think │◄──────┐

└────────┬────────┘ │

│ │

┌────────▼────────┐ │

│ Select Tool │ │

└────────┬────────┘ │

│ │

┌─────────────┼─────────────┐ │

▼ ▼ ▼ │

[web_search] [calculator] [file_reader]

│ │ │ │

└─────────────┼─────────────┘ │

│ │

┌────────▼────────┐ │

│ Observe │───────┘

└────────┬────────┘

│

┌────────▼────────┐

│ Final Answer │

└─────────────────┘

Export workflow as Mermaid diagram

python scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid

Prompt Engineering Workflows

Prompt Optimization Workflow

Use when improving an existing prompt's performance or reducing token costs.

Step 1: Baseline current prompt

python scripts/prompt_optimizer.py current_prompt.txt --analyze --output baseline.json

Step 2: Identify issues Review the analysis report for:

Token waste (redundant instructions, verbose examples)
Ambiguous instructions (unclear output format, vague verbs)
Missing constraints (no length limits, no format specification)

Step 3: Apply optimization patterns

Issue Pattern to Apply

Ambiguous output Add explicit format specification

Too verbose Extract to few-shot examples

Inconsistent results Add role/persona framing

Missing edge cases Add constraint boundaries

Step 4: Generate optimized version

python scripts/prompt_optimizer.py current_prompt.txt --optimize --output optimized.txt

Step 5: Compare results

python scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json

Shows: token reduction, clarity improvement, issues resolved

Step 6: Validate with test cases Run both prompts against your evaluation set and compare outputs.

Few-Shot Example Design Workflow

Use when creating examples for in-context learning.

Step 1: Define the task clearly

Task: Extract product entities from customer reviews Input: Review text Output: JSON with {product_name, sentiment, features_mentioned}

Step 2: Select diverse examples (3-5 recommended)

Example Type Purpose

Simple case Shows basic pattern

Edge case Handles ambiguity

Complex case Multiple entities

Negative case What NOT to extract

Step 3: Format consistently

Example 1: Input: "Love my new iPhone 15, the camera is amazing!" Output: {"product_name": "iPhone 15", "sentiment": "positive", "features_mentioned": ["camera"]}

Example 2: Input: "The laptop was okay but battery life is terrible." Output: {"product_name": "laptop", "sentiment": "mixed", "features_mentioned": ["battery life"]}

Step 4: Validate example quality

python scripts/prompt_optimizer.py prompt_with_examples.txt --validate-examples

Checks: consistency, coverage, format alignment

Step 5: Test with held-out cases Ensure model generalizes beyond your examples.

Structured Output Design Workflow

Use when you need reliable JSON/XML/structured responses.

Step 1: Define schema

{ "type": "object", "properties": { "summary": {"type": "string", "maxLength": 200}, "sentiment": {"enum": ["positive", "negative", "neutral"]}, "confidence": {"type": "number", "minimum": 0, "maximum": 1} }, "required": ["summary", "sentiment"] }

Step 2: Include schema in prompt

Respond with JSON matching this schema:

summary (string, max 200 chars): Brief summary of the content
sentiment (enum): One of "positive", "negative", "neutral"
confidence (number 0-1): Your confidence in the sentiment

Step 3: Add format enforcement

IMPORTANT: Respond ONLY with valid JSON. No markdown, no explanation. Start your response with { and end with }

Step 4: Validate outputs

python scripts/prompt_optimizer.py structured_prompt.txt --validate-schema schema.json

Reference Documentation

File Contains Load when user asks about

references/prompt_engineering_patterns.md

10 prompt patterns with input/output examples "which pattern?", "few-shot", "chain-of-thought", "role prompting"

references/llm_evaluation_frameworks.md

Evaluation metrics, scoring methods, A/B testing "how to evaluate?", "measure quality", "compare prompts"

references/agentic_system_design.md

Agent architectures (ReAct, Plan-Execute, Tool Use) "build agent", "tool calling", "multi-agent"

Common Patterns Quick Reference

Pattern When to Use Example

Zero-shot Simple, well-defined tasks "Classify this email as spam or not spam"

Few-shot Complex tasks, consistent format needed Provide 3-5 examples before the task

Chain-of-Thought Reasoning, math, multi-step logic "Think step by step..."

Role Prompting Expertise needed, specific perspective "You are an expert tax accountant..."

Structured Output Need parseable JSON/XML Include schema + format enforcement

Common Commands

Prompt Analysis

python scripts/prompt_optimizer.py prompt.txt --analyze # Full analysis python scripts/prompt_optimizer.py prompt.txt --tokens # Token count only python scripts/prompt_optimizer.py prompt.txt --optimize # Generate optimized version

RAG Evaluation

python scripts/rag_evaluator.py --contexts ctx.json --questions q.json # Evaluate python scripts/rag_evaluator.py --contexts ctx.json --compare baseline # Compare to baseline

Agent Development

python scripts/agent_orchestrator.py agent.yaml --validate # Validate config python scripts/agent_orchestrator.py agent.yaml --visualize # Show workflow python scripts/agent_orchestrator.py agent.yaml --estimate-cost # Token estimation

senior-prompt-engineer

Safety Notice

Copy this and send it to your AI assistant to learn

Analyze and optimize a prompt file

Evaluate RAG retrieval quality

Visualize agent workflow from definition

Analyze a prompt file

Output:

Token count: 847

Estimated cost: $0.0025 (GPT-4)

Clarity score: 72/100

Issues found:

- Ambiguous instruction at line 3

- Missing output format specification

- Redundant context (lines 12-15 repeat lines 5-8)

Suggestions:

1. Add explicit output format: "Respond in JSON with keys: ..."

2. Remove redundant context to save 89 tokens

3. Clarify "analyze" -> "list the top 3 issues with severity ratings"

Generate optimized version

Count tokens for cost estimation

Extract and manage few-shot examples

Evaluate retrieval quality

Output:

=== RAG Evaluation Report ===

Questions evaluated: 50

Retrieval Metrics:

Context Relevance: 0.78 (target: >0.80)

Retrieval Precision@5: 0.72

Coverage: 0.85

Generation Metrics:

Answer Faithfulness: 0.91

Groundedness: 0.88

Issues Found:

- 8 questions had no relevant context in top-5

- 3 answers contained information not in context

Recommendations:

1. Improve chunking strategy for technical documents

2. Add metadata filtering for date-sensitive queries

Evaluate with custom metrics

Export detailed results

Validate agent configuration

Output:

=== Agent Validation Report ===

Agent: research_assistant

Pattern: ReAct

Tools (4 registered):

[OK] web_search - API key configured

[OK] calculator - No config needed

[WARN] file_reader - Missing allowed_paths

[OK] summarizer - Prompt template valid

Flow Analysis:

Max depth: 5 iterations

Estimated tokens/run: 2,400-4,800

Potential infinite loop: No

Recommendations:

1. Add allowed_paths to file_reader for security

2. Consider adding early exit condition for simple queries

Visualize agent workflow (ASCII)

Output:

┌─────────────────────────────────────────┐

│ research_assistant │

│ (ReAct Pattern) │

└─────────────────┬───────────────────────┘

│

┌────────▼────────┐

│ User Query │

└────────┬────────┘

│

┌────────▼────────┐

│ Think │◄──────┐

└────────┬────────┘ │

│ │

┌────────▼────────┐ │

│ Select Tool │ │

└────────┬────────┘ │

│ │

┌─────────────┼─────────────┐ │

▼ ▼ ▼ │

[web_search] [calculator] [file_reader]