Multi-Agent Observability Skill

Build observability interfaces for monitoring and measuring multi-agent systems.

Purpose

Guide the design and implementation of observability layers that provide real-time visibility into multi-agent execution.

When to Use

Designing monitoring for agent fleets
Building metrics dashboards
Implementing logging architecture
Creating cost tracking systems

Prerequisites

Understanding of the Three Pillars (@three-pillars-orchestration.md)
Familiarity with results-oriented patterns (@results-oriented-engineering.md)
Access to Claude Agent SDK documentation

SDK Requirement

Implementation Note: Full observability requires Claude Agent SDK with custom MCP tools and UI components. This skill provides design patterns.

The Critical Principle

"If you can't measure it, you can't improve it. If you can't measure it, you can't scale it."

What to Observe

Per-Agent Metrics

Metric Purpose How to Track

Status Know state Agent state enum

Context usage Token consumption API response

Cost Financial impact API usage data

Tool calls What it's doing Hook logging

Results Output verification Result parsing

Duration Execution time Timestamps

Aggregate Metrics

Metric Purpose Calculation

Total agents Scale Count active

Total duration End-to-end time First to last

Total cost Financial total Sum per-agent

Success rate Reliability Success / total

Coverage Scope Files touched

Observability Components

Agent Cards

Real-time status for each agent:

┌─────────────────────────────────────┐ │ scout_1 [EXECUTING] │ ├─────────────────────────────────────┤ │ Template: scout-fast │ │ Model: haiku │ │ Context: 12,500 / 100,000 tokens │ │ Cost: $0.05 │ │ Duration: 45s │ │ Tool calls: 15 │ └─────────────────────────────────────┘

Required fields:

Agent ID and template
Status (idle, executing, complete, error)
Model being used
Context usage (current / max)
Running cost
Execution duration
Tool call count

Event Stream

Real-time log of all activities:

[10:30:00] scout_1 created (template: scout-fast) [10:30:01] scout_1 commanded: "Analyze auth module" [10:30:05] scout_1 Read: src/auth/login.ts [10:30:08] scout_1 Grep: "password" in src/auth/ [10:30:15] scout_1 completed (duration: 14s) [10:30:16] scout_1 deleted

Event types:

Agent lifecycle (create, delete)
Commands sent
Tool calls
Status changes
Errors

Cost Tracking

Track spend per agent and total:

Cost Summary ──────────────────────────────────── scout_1 (haiku) $0.05 scout_2 (haiku) $0.04 builder_1 (sonnet) $0.35 reviewer_1 (sonnet) $0.12 ──────────────────────────────────── Total $0.56 Budget remaining $4.44 (89%)

Cost components:

Input tokens
Output tokens
Per-agent breakdown
Running total
Budget tracking

Result Inspector

View consumed and produced assets:

Agent: builder_1

Consumed Assets: ├── Scout report (summary) ├── src/auth/middleware.ts └── package.json

Produced Assets: ├── src/auth/rate-limit.ts (created) ├── src/auth/middleware.ts (modified) └── tests/rate-limit.test.ts (created)

Summary: "Implemented rate limiting middleware" Status: completed

Log Viewer

Filterable activity history:

Filters: [agent: all] [level: all] [tool: all]

10:30:00 INFO scout_1 Created from template 10:30:01 INFO scout_1 Received command 10:30:05 DEBUG scout_1 Read: src/auth/login.ts (1,200 tokens) 10:30:08 DEBUG scout_1 Grep: found 5 matches 10:30:12 WARN scout_1 Context at 80% capacity 10:30:15 INFO scout_1 Completed successfully

Implementation Patterns

Logging Architecture

Event types

class AgentEvent: timestamp: datetime agent_id: str event_type: str # create, command, tool, status, error details: dict

Log collector

def log_event(event: AgentEvent): # Store to database db.events.insert(event) # Emit to WebSocket ws.broadcast(event) # Update metrics metrics.update(event)

Real-Time Updates

WebSocket for live updates

async def agent_status_stream(agent_id): while agent_active(agent_id): status = get_agent_status(agent_id) yield status await asyncio.sleep(1)

Cost Calculation

def calculate_cost(usage): input_cost = usage.input_tokens * MODEL_INPUT_PRICE output_cost = usage.output_tokens * MODEL_OUTPUT_PRICE return input_cost + output_cost

UI Components

Minimal CLI View

Orchestration: Add rate limiting ──────────────────────────────────── Agents: 3 active | 2 complete | 0 error Cost: $0.56 / $5.00 budget Progress: ████████░░ 80%

[scout_1] ✓ complete (14s) [scout_2] ✓ complete (12s) [builder] ⚡ executing (45s)

Rich Dashboard View

┌─────────────────────────────────────────────────────────────┐ │ Orchestration Dashboard │ ├─────────────────────────────────────────────────────────────┤ │ Task: Add rate limiting to authentication │ │ Started: 10:30:00 | Duration: 2m 15s | Cost: $0.56 │ ├─────────────────────────────────────────────────────────────┤ │ Agent Fleet │ Event Stream │ │ ┌─────────────────────────────────┐ │ [10:32:15] builder │ │ │ scout_1 [✓ complete] │ │ Write: rate-limit │ │ │ scout_2 [✓ complete] │ │ [10:32:10] builder │ │ │ builder [⚡ executing] │ │ Read: middleware │ │ │ reviewer [○ pending] │ │ [10:30:15] scout_2 │ │ └─────────────────────────────────┘ │ completed │ ├─────────────────────────────────────────────────────────────┤ │ Cost Breakdown │ Results Summary │ │ haiku: $0.09 │ Files read: 8 │ │ sonnet: $0.47 │ Files written: 3 │ │ Total: $0.56 │ Tests: 5/5 passing │ └─────────────────────────────────────────────────────────────┘

Design Checklist

Per-agent metrics defined
Aggregate metrics calculated
Event logging implemented
Real-time updates via WebSocket
Cost tracking per agent
Result inspection available
Log filtering supported
UI components designed

Output Format

When designing observability, provide:

Observability Design

Metrics

Per-Agent: [List with tracking method]

Aggregate: [List with calculation]

Components

Agent Cards: [fields and update frequency] Event Stream: [event types and storage] Cost Tracking: [breakdown and budgets] Result Inspector: [consumed/produced format] Log Viewer: [filters and retention]

Implementation

Logging: [architecture] Real-Time: [WebSocket design] Storage: [database schema] UI: [component specifications]

Anti-Patterns

Anti-Pattern Problem Solution

No metrics Flying blind Track everything

Delayed updates Stale status Real-time WebSocket

No cost tracking Budget overruns Per-agent costs

Missing logs Can't debug Log all events

No aggregation Can't summarize Calculate totals

Cross-References

@three-pillars-orchestration.md - Observability pillar
@results-oriented-engineering.md - Result patterns
@agent-lifecycle-crud.md - Agent state tracking
@orchestrator-design skill - System architecture

Version History

v1.0.0 (2025-12-26): Initial release

Last Updated

Date: 2025-12-26 Model: claude-opus-4-5-20251101

multi-agent-observability

Safety Notice

Copy this and send it to your AI assistant to learn

Event types

Log collector

WebSocket for live updates

Observability Design

Metrics

Components

Implementation

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering