Testing Workflow
This skill provides tools for testing agents built with the building-agents skill.
Workflow Overview
-
mcp__agent-builder__list_tests
-
Check what tests exist
-
mcp__agent-builder__generate_constraint_tests or mcp__agent-builder__generate_success_tests
-
Get test guidelines
-
Write tests directly using the Write tool with the guidelines provided
-
mcp__agent-builder__run_tests
-
Execute tests
-
mcp__agent-builder__debug_test
-
Debug failures
How Test Generation Works
The generate_*_tests MCP tools return guidelines and templates - they do NOT generate test code via LLM. You (Claude) write the tests directly using the Write tool based on the guidelines.
Example Workflow
Step 1: Get test guidelines
result = mcp__agent-builder__generate_constraint_tests( goal_id="my-goal", goal_json='{"id": "...", "constraints": [...]}', agent_path="exports/my_agent" )
Step 2: The result contains:
- output_file: where to write tests
- file_header: imports and fixtures to use
- test_template: format for test functions
- constraints_formatted: the constraints to test
- test_guidelines: rules for writing tests
Step 3: Write tests directly using the Write tool
Write( file_path=result["output_file"], content=result["file_header"] + test_code_you_write )
Step 4: Run tests via MCP tool
mcp__agent-builder__run_tests( goal_id="my-goal", agent_path="exports/my_agent" )
Step 5: Debug failures via MCP tool
mcp__agent-builder__debug_test( goal_id="my-goal", test_name="test_constraint_foo", agent_path="exports/my_agent" )
Testing Agents with MCP Tools
Run goal-based evaluation tests for agents built with the building-agents skill.
Key Principle: MCP tools provide guidelines, Claude writes tests directly
-
✅ Get guidelines: generate_constraint_tests , generate_success_tests → returns templates and guidelines
-
✅ Write tests: Use the Write tool with the provided file_header and test_template
-
✅ Run tests: run_tests (runs pytest via subprocess)
-
✅ Debug failures: debug_test (re-runs single test with verbose output)
-
✅ List tests: list_tests (scans Python test files)
-
✅ Tests stored in exports/{agent}/tests/test_*.py
Architecture: Python Test Files
exports/my_agent/ ├── init.py ├── agent.py ← Agent to test ├── nodes/init.py ├── config.py ├── main.py └── tests/ ← Test files written by MCP tools ├── conftest.py # Shared fixtures (auto-created) ├── test_constraints.py ├── test_success_criteria.py └── test_edge_cases.py
Tests import the agent directly:
import pytest from exports.my_agent import default_agent
@pytest.mark.asyncio async def test_happy_path(mock_mode): result = await default_agent.run({"query": "test"}, mock_mode=mock_mode) assert result.success assert len(result.output) > 0
Why This Approach
-
MCP tools provide consistent test guidelines with proper imports, fixtures, and API key enforcement
-
Claude writes tests directly, eliminating circular LLM dependencies in the MCP server
-
run_tests parses pytest output into structured results for iteration
-
debug_test provides formatted output with actionable debugging info
-
File headers include conftest.py setup with proper fixtures
Quick Start
-
Check existing tests - list_tests(goal_id, agent_path)
-
Get test guidelines - generate_constraint_tests or generate_success_tests
-
Write tests - Use the Write tool with the provided file_header and guidelines
-
Run tests - run_tests(goal_id, agent_path)
-
Debug failures - debug_test(goal_id, test_name, agent_path)
-
Iterate - Repeat steps 4-5 until all pass
⚠️ Credential Requirements for Testing
CRITICAL: Testing requires ALL credentials the agent depends on. This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).
Prerequisites
Before running agent tests, you MUST collect ALL required credentials from the user.
Step 1: LLM API Key (always required)
export ANTHROPIC_API_KEY="your-key-here"
Step 2: Tool-specific credentials (depends on agent's tools)
Inspect the agent's mcp_servers.json and tool configuration to determine which tools the agent uses, then check for all required credentials:
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS
creds = CredentialManager()
Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...] # e.g., ["hubspot_search_contacts", "web_search", ...]
Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)
Common tool credentials:
Tool Env Var Help URL
HubSpot CRM HUBSPOT_ACCESS_TOKEN
https://developers.hubspot.com/docs/api/private-apps
Brave Search BRAVE_SEARCH_API_KEY
Google Search GOOGLE_SEARCH_API_KEY
- GOOGLE_SEARCH_CX
https://developers.google.com/custom-search
Why ALL credentials are required:
-
Tests need to execute the agent's LLM nodes to validate behavior
-
Tools with missing credentials will return error dicts instead of real data
-
Mock mode bypasses everything, providing no confidence in real-world performance
-
The AgentRunner.run() method validates credentials at startup and will fail fast if any are missing
Mock Mode Limitations
Mock mode (--mock flag or mock_mode=True ) is ONLY for structure validation:
✓ Validates graph structure (nodes, edges, connections) ✓ Tests that code doesn't crash on execution ✗ Does NOT test LLM message generation ✗ Does NOT test reasoning or decision-making quality ✗ Does NOT test constraint validation (length limits, format rules) ✗ Does NOT test real API integrations or tool use ✗ Does NOT test personalization or content quality
Bottom line: If you're testing whether an agent achieves its goal, you MUST use real credentials for ALL services.
Enforcing Credentials in Tests
When generating tests, ALWAYS include credential checks for ALL required services:
import os import pytest from aden_tools.credentials import CredentialManager
At the top of every test file
pytestmark = pytest.mark.skipif( not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"), reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1 for structure validation only." )
@pytest.fixture(scope="session", autouse=True) def check_credentials(): """Ensure ALL required credentials are set for real testing.""" creds = CredentialManager() mock_mode = os.environ.get("MOCK_MODE")
# Always check LLM key
if not creds.is_available("anthropic"):
if mock_mode:
print("\n⚠️ Running in MOCK MODE - structure validation only")
print(" This does NOT test LLM behavior or agent quality")
print(" Set ANTHROPIC_API_KEY for real testing\n")
else:
pytest.fail(
"\n❌ ANTHROPIC_API_KEY not set!\n\n"
"Real testing requires an API key. Choose one:\n"
"1. Set API key (RECOMMENDED):\n"
" export ANTHROPIC_API_KEY='your-key-here'\n"
"2. Run structure validation only:\n"
" MOCK_MODE=1 pytest exports/{agent}/tests/\n\n"
"Note: Mock mode does NOT validate agent behavior or quality."
)
# Check tool-specific credentials (skip in mock mode)
if not mock_mode:
# List the tools this agent uses - update per agent
agent_tools = [] # e.g., ["hubspot_search_contacts", "hubspot_get_contact"]
missing = creds.get_missing_for_tools(agent_tools)
if missing:
lines = ["\n❌ Missing tool credentials!\n"]
for name in missing:
spec = creds.specs.get(name)
if spec:
lines.append(f" {spec.env_var} - {spec.description}")
if spec.help_url:
lines.append(f" Setup: {spec.help_url}")
lines.append("\nSet the required environment variables and re-run.")
pytest.fail("\n".join(lines))
User Communication
When the user asks to test an agent, ALWAYS check for ALL credentials first — not just the LLM key:
-
Identify the agent's tools from agent.json or mcp_servers.json
-
Check ALL required credentials using CredentialManager
-
Ask the user to provide any missing credentials before proceeding
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS
creds = CredentialManager()
1. Check LLM key
missing_creds = [] if not creds.is_available("anthropic"): missing_creds.append(("ANTHROPIC_API_KEY", "Anthropic API key for LLM calls"))
2. Check tool-specific credentials
agent_tools = [...] # Determined from agent config missing_tools = creds.get_missing_for_tools(agent_tools) for name in missing_tools: spec = CREDENTIAL_SPECS.get(name) if spec: missing_creds.append((spec.env_var, spec.description))
3. Present ALL missing credentials to the user at once
if missing_creds: print("⚠️ Missing credentials required by this agent:\n") for env_var, description in missing_creds: print(f" • {env_var} — {description}") print() print("Please set the missing environment variables:") for env_var, _ in missing_creds: print(f" export {env_var}='your-value-here'") print() print("Or run in mock mode (structure validation only):") print(" MOCK_MODE=1 pytest exports/{agent}/tests/")
# Ask user to provide credentials or choose mock mode
AskUserQuestion(...)
IMPORTANT: Do NOT skip credential collection. If an agent uses HubSpot tools, the user MUST provide HUBSPOT_ACCESS_TOKEN . If it uses web search, the user MUST provide the appropriate search API key. Collect ALL missing credentials in a single prompt rather than discovering them one at a time during test failures.
The Three-Stage Flow
┌─────────────────────────────────────────────────────────────────────────┐ │ GOAL STAGE │ │ (building-agents skill) │ │ │ │ 1. User defines goal with success_criteria and constraints │ │ 2. Goal written to agent.py immediately │ │ 3. Generate CONSTRAINT TESTS → Write to tests/ → USER APPROVAL │ │ Files created: exports/{agent}/tests/test_constraints.py │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ AGENT STAGE │ │ (building-agents skill) │ │ │ │ Build nodes + edges, written immediately to files │ │ Constraint tests can run during development: │ │ run_tests(goal_id, agent_path, test_types='["constraint"]') │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ EVAL STAGE (this skill) │ │ │ │ 1. Generate SUCCESS_CRITERIA TESTS → Write to tests/ → USER APPROVAL │ │ Files created: exports/{agent}/tests/test_success_criteria.py │ │ 2. Run all tests: run_tests(goal_id, agent_path) │ │ 3. On failure → debug_test(goal_id, test_name, agent_path) │ │ 4. Iterate: Edit agent code → Re-run run_tests (instant feedback) │ └─────────────────────────────────────────────────────────────────────────┘
Step-by-Step: Testing an Agent
Step 1: Check Existing Tests
ALWAYS check first before generating new tests:
mcp__agent-builder__list_tests( goal_id="your-goal-id", agent_path="exports/your_agent" )
This shows what test files already exist. If tests exist:
-
Review the list to see what's covered
-
Ask user if they want to add more or run existing tests
Step 2: Get Constraint Test Guidelines (Goal Stage)
After goal is defined, get test guidelines using the MCP tool:
First, read the goal from agent.py to get the goal JSON
goal_code = Read(file_path="exports/your_agent/agent.py")
Extract the goal definition and convert to JSON
Get constraint test guidelines via MCP tool
result = mcp__agent-builder__generate_constraint_tests( goal_id="your-goal-id", goal_json='{"id": "goal-id", "name": "...", "constraints": [...]}', agent_path="exports/your_agent" )
Response includes:
-
output_file : Where to write tests (e.g., exports/your_agent/tests/test_constraints.py )
-
file_header : Imports, fixtures, and pytest setup to use at the top of the file
-
test_template : Format for test functions
-
constraints_formatted : The constraints to test
-
test_guidelines : Rules and best practices for writing tests
-
instruction : How to proceed
Write tests directly using the provided guidelines:
Write tests using the Write tool
Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + your_test_code )
Step 3: Get Success Criteria Test Guidelines (Eval Stage)
After agent is fully built, get success criteria test guidelines:
Get success criteria test guidelines via MCP tool
result = mcp__agent-builder__generate_success_tests( goal_id="your-goal-id", goal_json='{"id": "goal-id", "name": "...", "success_criteria": [...]}', node_names="analyze_request,search_web,format_results", tool_names="web_search,web_scrape", agent_path="exports/your_agent" )
Write tests directly using the provided guidelines:
Write tests using the Write tool
Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + your_test_code )
Step 4: Test Fixtures (conftest.py)
The file_header returned by the MCP tools includes proper imports and fixtures. You should also create a conftest.py file in the tests directory with shared fixtures:
Create conftest.py with the conftest template
Write( file_path="exports/your_agent/tests/conftest.py", content=conftest_content # Use PYTEST_CONFTEST_TEMPLATE format )
Step 5: Run Tests
Use the MCP tool to run tests (not pytest directly):
mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent" )
Response includes structured results:
{
"goal_id": "your-goal-id",
"overall_passed": false,
"summary": {
"total": 12,
"passed": 10,
"failed": 2,
"skipped": 0,
"errors": 0,
"pass_rate": "83.3%"
},
"test_results": [
{"file": "test_constraints.py", "test_name": "test_constraint_api_rate_limits", "status": "passed"},
{"file": "test_success_criteria.py", "test_name": "test_success_find_relevant_results", "status": "failed"}
],
"failures": [
{"test_name": "test_success_find_relevant_results", "details": "AssertionError: Expected 3-5 results..."}
]
}
Options for run_tests
:
# Run only constraint tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["constraint"]'
)
# Run with parallel workers
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
parallel=4
)
# Stop on first failure
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
fail_fast=True
)
Step 6: Debug Failed Tests
Use the MCP tool to debug (not Bash/pytest directly):
mcp__agent-builder__debug_test(
goal_id="your-goal-id",
test_name="test_success_find_relevant_results",
agent_path="exports/your_agent"
)
Response includes:
- Full verbose output from the test
- Stack trace with exact line numbers
- Captured logs and prints
- Suggestions for fixing the issue
Step 7: Categorize Errors
When a test fails, categorize the error to guide iteration:
def categorize_test_failure(test_output, agent_code):
"""Categorize test failure to guide iteration."""
# Read test output and agent code
failure_info = {
"test_name": "...",
"error_message": "...",
"stack_trace": "...",
}
# Pattern-based categorization
if any(pattern in failure_info["error_message"].lower() for pattern in [
"typeerror", "attributeerror", "keyerror", "valueerror",
"null", "none", "undefined", "tool call failed"
]):
category = "IMPLEMENTATION_ERROR"
guidance = {
"stage": "Agent",
"action": "Fix the bug in agent code",
"files_to_edit": ["agent.py", "nodes/__init__.py"],
"restart_required": False,
"description": "Code bug - fix and re-run tests"
}
elif any(pattern in failure_info["error_message"].lower() for pattern in [
"assertion", "expected", "got", "should be", "success criteria"
]):
category = "LOGIC_ERROR"
guidance = {
"stage": "Goal",
"action": "Update goal definition",
"files_to_edit": ["agent.py (goal section)"],
"restart_required": True,
"description": "Goal definition is wrong - update and rebuild"
}
elif any(pattern in failure_info["error_message"].lower() for pattern in [
"timeout", "rate limit", "empty", "boundary", "edge case"
]):
category = "EDGE_CASE"
guidance = {
"stage": "Eval",
"action": "Add edge case test and fix handling",
"files_to_edit": ["agent.py", "tests/test_edge_cases.py"],
"restart_required": False,
"description": "New scenario - add test and handle it"
}
else:
category = "UNKNOWN"
guidance = {
"stage": "Unknown",
"action": "Manual investigation required",
"restart_required": False
}
return {
"category": category,
"guidance": guidance,
"failure_info": failure_info
}
Show categorization to user:
AskUserQuestion(
questions=[{
"question": f"Test failed with {category}. How would you like to proceed?",
"header": "Test Failure",
"options": [
{
"label": "Fix code directly (Recommended)" if category == "IMPLEMENTATION_ERROR" else "Update goal",
"description": guidance["description"]
},
{
"label": "Show detailed error info",
"description": "View full stack trace and logs"
},
{
"label": "Skip for now",
"description": "Continue with other tests"
}
],
"multiSelect": false
}]
)
Step 8: Iterate Based on Error Category
IMPLEMENTATION_ERROR → Fix Agent Code
# 1. Show user the exact file and line that failed
print(f"Error in: exports/{agent_name}/nodes/__init__.py:42")
print(f"Issue: 'NoneType' object has no attribute 'get'")
# 2. Read the problematic code
code = Read(file_path=f"exports/{agent_name}/nodes/__init__.py")
# 3. User can fix directly, or you suggest a fix:
Edit(
file_path=f"exports/{agent_name}/nodes/__init__.py",
old_string="if results.get('videos'):",
new_string="if results and results.get('videos'):"
)
# 4. Re-run tests immediately (instant feedback!)
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}"
)
LOGIC_ERROR → Update Goal
# 1. Show user the goal definition
goal_code = Read(file_path=f"exports/{agent_name}/agent.py")
# 2. Discuss what needs to change in success_criteria or constraints
# 3. Edit the goal
Edit(
file_path=f"exports/{agent_name}/agent.py",
old_string='target="3-5 videos"',
new_string='target="1-5 videos"' # More realistic
)
# 4. May need to regenerate agent nodes if goal changed significantly
# This requires going back to building-agents skill
EDGE_CASE → Add Test and Fix
# 1. Create new edge case test with API key enforcement
edge_case_test = '''
@pytest.mark.asyncio
async def test_edge_case_empty_results(mock_mode):
"""Test: Agent handles no results gracefully"""
result = await default_agent.run({{"query": "xyzabc123nonsense"}}, mock_mode=mock_mode)
# Should succeed with empty results, not crash
assert result.success or result.error is not None
if result.success:
assert result.output.get("message") == "No results found"
'''
# 2. Add to test file
Edit(
file_path=f"exports/{agent_name}/tests/test_edge_cases.py",
old_string="# Add edge case tests here",
new_string=edge_case_test
)
# 3. Fix agent to handle edge case
# Edit agent code to handle empty results
# 4. Re-run tests
Test File Templates (Reference Only)
⚠️ Do NOT copy-paste these templates directly. Use generate_constraint_tests
and generate_success_tests
MCP tools to create properly structured tests with correct imports and fixtures.
These templates show the structure of generated tests for reference only.
Constraint Test Template
"""Constraint tests for {agent_name}.
These tests validate that the agent respects its defined constraints.
Requires ANTHROPIC_API_KEY for real testing.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_constraint_{constraint_id}():
"""Test: {constraint_description}"""
# Test implementation based on constraint type
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)
# Assert constraint is respected
assert True # Replace with actual check
Success Criteria Test Template
"""Success criteria tests for {agent_name}.
These tests validate that the agent achieves its defined success criteria.
Requires ANTHROPIC_API_KEY for real testing - mock mode cannot validate success criteria.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_success_{criteria_id}():
"""Test: {criteria_description}"""
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)
assert result.success, f"Agent failed: {{result.error}}"
# Verify success criterion met
# e.g., assert metric meets target
assert True # Replace with actual check
Edge Case Test Template
"""Edge case tests for {agent_name}.
These tests validate agent behavior in unusual or boundary conditions.
Requires ANTHROPIC_API_KEY for real testing.
"""
import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager
# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.mark.asyncio
async def test_edge_case_{scenario_name}():
"""Test: Agent handles {scenario_description}"""
mock_mode = bool(os.environ.get("MOCK_MODE"))
result = await default_agent.run({{"edge": "case_input"}}, mock_mode=mock_mode)
# Verify graceful handling
assert result.success or result.error is not None
Interactive Build + Test Loop
During agent construction (Agent stage), you can run constraint tests incrementally:
# After adding first node
print("Added search_node. Running relevant constraint tests...")
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}",
test_types='["constraint"]'
)
# After adding second node
print("Added filter_node. Running all constraint tests...")
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path=f"exports/{agent_name}",
test_types='["constraint"]'
)
This provides immediate feedback during development, catching issues early.
Common Test Patterns
Note: All test patterns should include API key enforcement via conftest.py.
⚠️ CRITICAL: Framework Features You Must Know
OutputCleaner - Automatic I/O Cleaning (NEW!)
The framework now automatically validates and cleans node outputs using a fast LLM (Cerebras llama-3.3-70b) at edge traversal time. This prevents cascading failures from malformed output.
What OutputCleaner does:
- ✅ Validates output matches next node's input schema
- ✅ Detects JSON parsing trap (entire response in one key)
- ✅ Cleans malformed output automatically (~200-500ms, ~$0.001 per cleaning)
- ✅ Boosts success rates by 1.8-2.2x
Impact on tests: Tests should still use safe patterns because OutputCleaner may not catch all issues in test mode.
Safe Test Patterns (REQUIRED)
❌ UNSAFE (will cause test failures):
# Direct key access - can crash!
approval_decision = result.output["approval_decision"]
assert approval_decision == "APPROVED"
# Nested access without checks
category = result.output["analysis"]["category"]
# Assuming parsed JSON structure
for issue in result.output["compliance_issues"]:
...
✅ SAFE (correct patterns):
# 1. Safe dict access with .get()
output = result.output or {}
approval_decision = output.get("approval_decision", "UNKNOWN")
assert "APPROVED" in approval_decision or approval_decision == "APPROVED"
# 2. Type checking before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
category = analysis.get("category", "unknown")
# 3. Parse JSON from strings (the JSON parsing trap!)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
try:
parsed = json.loads(recommendation)
if isinstance(parsed, dict):
approval = parsed.get("approval_decision", "UNKNOWN")
except json.JSONDecodeError:
approval = "UNKNOWN"
elif isinstance(recommendation, dict):
approval = recommendation.get("approval_decision", "UNKNOWN")
# 4. Safe iteration with type check
compliance_issues = output.get("compliance_issues", [])
if isinstance(compliance_issues, list):
for issue in compliance_issues:
...
Helper Functions for Safe Access
Add to conftest.py:
import json
import re
def _parse_json_from_output(result, key):
"""Parse JSON from agent output (framework may store full LLM response as string)."""
response_text = result.output.get(key, "")
# Remove markdown code blocks if present
json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()
try:
return json.loads(json_text)
except (json.JSONDecodeError, AttributeError, TypeError):
return result.output.get(key)
def safe_get_nested(result, key_path, default=None):
"""Safely get nested value from result.output."""
output = result.output or {}
current = output
for key in key_path:
if isinstance(current, dict):
current = current.get(key)
elif isinstance(current, str):
try:
json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
parsed = json.loads(json_text)
if isinstance(parsed, dict):
current = parsed.get(key)
else:
return default
except json.JSONDecodeError:
return default
else:
return default
return current if current is not None else default
# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested
Usage in tests:
# Use helper to parse JSON safely
parsed = pytest.parse_json_from_output(result, "recommendation")
if isinstance(parsed, dict):
approval = parsed.get("approval_decision", "UNKNOWN")
# Safe nested access
risk_score = pytest.safe_get_nested(result, ["analysis", "risk_score"], default=0.0)
Test Count Guidance
Generate 8-15 tests total, NOT 30+
- ✅ 2-3 tests per success criterion
- ✅ 1 happy path test
- ✅ 1 boundary/edge case test
- ✅ 1 error handling test (optional)
Why fewer tests?:
- Each test requires real LLM call (~3 seconds, costs money)
- 30 tests = 90 seconds, $0.30+ in costs
- 12 tests = 36 seconds, $0.12 in costs
- Focus on quality over quantity
ExecutionResult Fields (Important!)
result.success=True
means NO exception, NOT goal achieved
# ❌ WRONG - assumes goal achieved
assert result.success
# ✅ RIGHT - check success AND output
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"
All ExecutionResult fields:
- success: bool
- Execution completed without exception (NOT goal achieved!)
- output: dict
- Complete memory snapshot (may contain raw strings)
- error: str | None
- Error message if failed
- steps_executed: int
- Number of nodes executed
- total_tokens: int
- Cumulative token usage
- total_latency_ms: int
- Total execution time
- path: list[str]
- Node IDs traversed
- paused_at: str | None
- Node ID if HITL pause occurred
- session_state: dict
- State for resuming
Happy Path Test
@pytest.mark.asyncio
async def test_happy_path(mock_mode):
"""Test normal successful execution"""
result = await default_agent.run({{"query": "python tutorials"}}, mock_mode=mock_mode)
assert result.success
assert len(result.output) > 0
Boundary Condition Test
@pytest.mark.asyncio
async def test_boundary_minimum(mock_mode):
"""Test at minimum threshold"""
result = await default_agent.run({{"query": "very specific niche topic"}}, mock_mode=mock_mode)
assert result.success
assert len(result.output.get("results", [])) >= 1
Error Handling Test
@pytest.mark.asyncio
async def test_error_handling(mock_mode):
"""Test graceful error handling"""
result = await default_agent.run({{"query": ""}}, mock_mode=mock_mode) # Invalid input
assert not result.success or result.output.get("error") is not None
Performance Test
@pytest.mark.asyncio
async def test_performance_latency(mock_mode):
"""Test response time is acceptable"""
import time
start = time.time()
result = await default_agent.run({{"query": "test"}}, mock_mode=mock_mode)
duration = time.time() - start
assert duration < 5.0, f"Took {{duration}}s, expected <5s"
Integration with building-agents
Handoff Points
Scenario
From
To
Action
Agent built, ready to test
building-agents
testing-agent
Generate success tests
LOGIC_ERROR found
testing-agent
building-agents
Update goal, rebuild
IMPLEMENTATION_ERROR found
testing-agent
Direct fix
Edit agent files, re-run tests
EDGE_CASE found
testing-agent
testing-agent
Add edge case test
All tests pass
testing-agent
Done
Agent validated ✅
Iteration Speed Comparison
Scenario
Old Approach
New Approach
Bug Fix
Rebuild via MCP tools (14 min)
Edit Python file, pytest (2 min)
Add Test
Generate via MCP, export (5 min)
Write test file directly (1 min)
Debug
Read subprocess logs
pdb, breakpoints, prints
Inspect
Limited visibility
Full Python introspection
Anti-Patterns
Testing Best Practices
Don't
Do Instead
❌ Write tests without getting guidelines first
✅ Use generate_*_tests
to get proper file_header and guidelines
❌ Run pytest via Bash
✅ Use run_tests
MCP tool for structured results
❌ Debug tests with Bash pytest -vvs
✅ Use debug_test
MCP tool for formatted output
❌ Check for tests with Glob
✅ Use list_tests
MCP tool
❌ Skip the file_header from guidelines
✅ Always include the file_header for proper imports and fixtures
General Testing
Don't
Do Instead
❌ Treat all failures the same
✅ Use debug_test to categorize and iterate appropriately
❌ Rebuild entire agent for small bugs
✅ Edit code directly, re-run tests
❌ Run tests without API key
✅ Always set ANTHROPIC_API_KEY first
❌ Write tests without understanding the constraints/criteria
✅ Read the formatted constraints/criteria from guidelines
Workflow Summary
1. Check existing tests: list_tests(goal_id, agent_path)
→ Scans exports/{agent}/tests/test_*.py
↓
2. Get test guidelines: generate_constraint_tests, generate_success_tests
→ Returns file_header, test_template, constraints/criteria, guidelines
↓
3. Write tests: Use Write tool with the provided guidelines
→ Write tests to exports/{agent}/tests/test_*.py
↓
4. Run tests: run_tests(goal_id, agent_path)
→ Executes: pytest exports/{agent}/tests/ -v
↓
5. Debug failures: debug_test(goal_id, test_name, agent_path)
→ Re-runs single test with verbose output
↓
6. Fix based on category:
- IMPLEMENTATION_ERROR → Edit agent code directly
- ASSERTION_FAILURE → Fix agent logic or update test
- IMPORT_ERROR → Check package structure
- API_ERROR → Check API keys and connectivity
↓
7. Re-run tests: run_tests(goal_id, agent_path)
↓
8. Repeat until all pass ✅
MCP Tools Reference
# Check existing tests (scans Python test files)
mcp__agent-builder__list_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
# Get constraint test guidelines (returns templates and guidelines, NOT generated tests)
mcp__agent-builder__generate_constraint_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "constraints": [...]}',
agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines
# Get success criteria test guidelines
mcp__agent-builder__generate_success_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "success_criteria": [...]}',
node_names="node1,node2",
tool_names="tool1,tool2",
agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines
# Run tests via pytest subprocess
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent"
)
# Debug a failed test (re-runs with verbose output)
mcp__agent-builder__debug_test(
goal_id="your-goal-id",
test_name="test_constraint_foo",
agent_path="exports/your_agent"
)
run_tests Options
# Run only constraint tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["constraint"]'
)
# Run only success criteria tests
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
test_types='["success"]'
)
# Run with pytest-xdist parallelism (requires pytest-xdist)
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
parallel=4
)
# Stop on first failure
mcp__agent-builder__run_tests(
goal_id="your-goal-id",
agent_path="exports/your_agent",
fail_fast=True
)
Direct pytest Commands
You can also run tests directly with pytest (the MCP tools use pytest internally):
# Run all tests
pytest exports/your_agent/tests/ -v
# Run specific test file
pytest exports/your_agent/tests/test_constraints.py -v
# Run specific test
pytest exports/your_agent/tests/test_constraints.py::test_constraint_foo -vvs
# Run in mock mode (structure validation only)
MOCK_MODE=1 pytest exports/your_agent/tests/ -v
MCP tools generate tests, write them to Python files, and run them via pytest.