Structured Multi-Agent Deliberation Framework

A protocol for running multi-agent deliberation that produces evaluable output — claims with explicit lifecycle, verifications with cross-agent evidence, decisions that cite specific support, and stopping criteria that prevent both premature consensus and infinite loops.

The central premise: free-form multi-agent prompting produces consensus that looks robust but isn't. Without structural cross-validation, agents drift toward agreement (sycophancy cascade); without explicit claims, decisions can't be traced to evidence; without stopping criteria, deliberations either stop too early (premature consensus) or run forever (artifact accumulation).

This framework provides:

4 role schemas (Action / Guardian / Observer / Critic) — each contributing a perspective the others can't substitute
4 cross-validation checks per round — artifacts are not siloed; they must contact each other
Claims + verifications infrastructure — disagreements become testable claims with status lifecycles
6 goal-driven stopping criteria — explicit signals for when the deliberation should end
Stress test protocol — forced agent absence reveals which roles are load-bearing

When to use

Activate this skill when:

Designing a multi-agent system for non-trivial deliberation (not chitchat or task delegation, but actual disagreement-resolving discussion)
Instrumenting an agent debate to detect sycophancy / convergence quality
Seeking an evaluable alternative to free-form group prompting (e.g., "5 LLMs discuss X")
Running architectural / strategic / methodological reviews where cross-perspective rigor matters
Building a multi-agent system that needs claims.jsonl / verifications.jsonl / decisions.jsonl audit trails

Don't activate when:

The user wants simple multi-agent task delegation (not deliberation)
A single LLM with chain-of-thought is sufficient (no real perspective diversity needed)
The user is asking about LLM ensemble methods for accuracy gains (different problem)

The 4-role structure

Role	Question	Artifact
Action	"Where's the verb?"	Behavioral Trace Set
Guardian	"What's the invariant?"	Identity Invariant Check
Observer	"Can a real operator navigate this?"	Operator Walkthrough
Critic	"What's the strongest argument against this?"	Adversarial Test Case

Full role definitions: references/role-schemas.md.

Why these 4: each contributes a perspective the others can't substitute. Action grounds in state transitions; Guardian formalizes invariants; Observer tests usability; Critic forces adversarial scrutiny. Without one, the deliberation has a structural blind spot.

Round structure (8 phases)

Each round runs through:

A. Lead Assignment       → who leads this round
B. Lead Proposal         → 600-800 word position
C. Supplements           → other agents' takes (200-400 words each)
D. Task Phase            → 4 structured artifacts
E. Cross-Validation      → 4 verification checks
F. Claims Update         → new claims, status changes, validate
G. Assessment            → decisions, tensions, architecture impact
H. (every K rounds) Drift Check

Full template: templates/round-template.md.tmpl.

Cross-validation: 4 checks

After artifacts are produced (Phase D), 4 verification checks run before claims update:

Critic attack vs Action trace — does Critic's attack break Action's traces?
Guardian invariant vs Observer walkthrough — does Operator path violate the invariant?
Observer friction vs Critic attack — friendly-fire overlap (most valuable signal)
All artifacts vs prior decisions — continuity / drift detection

Each check produces a verifications.jsonl entry with evidence_refs ≥2 and a verdict (BROKEN / ROBUST / UNCLEAR / NOT_APPLICABLE).

Full protocol: references/verification-protocol.md.

Claims infrastructure

Disagreements become testable claims:

{
  "id": "C-{round}-{sequence}",
  "round": N,
  "raised_by": "action | guardian | observer | critic",
  "text": "the claim itself, 1-3 sentences",
  "testable_as": "single observable outcome",
  "status": "pending | tested_confirmed | tested_refuted | partially_refuted | tested_unclear | superseded"
}

Status transitions are driven by verification verdicts. Decisions cite claims and verifications.

Full schema and validator: references/claims-infrastructure.md, scripts/claims-validator.py.

Stopping: 6 signals

The deliberation should consider stopping when ≥4 of 6 signals are satisfied:

Claim refutation rate stabilizes
Disagreement slope flat or rising (no sycophancy)
All agents have led at least one non-stress round
Stress tests have been executed (≥N-1 agents stress-tested)
Drift checks pass
Pending claim fraction < 30%

Full criteria + detector script: references/stopping-criteria.md, scripts/stopping-detector.py.

Stress tests

Forced agent absence reveals load-bearing-ness. If a stress round looks identical to a normal round, the absent agent's role is decorative.

Recommended schedule for a 13-round deliberation:

R5: STRESS-CRITIC (test that adversarial pressure is structural)
R10: STRESS-GUARDIAN (test that invariant work has unique value)
R12: STRESS-ACTION (test that behavioral grounding has unique value)
R13: STRESS-OBSERVER (test that operator perspective has unique value)

In stress rounds, the absent agent's artifact is NOT produced. Verifications involving the absent agent's artifact return NOT_APPLICABLE with stress reason in notes.

Failure modes to watch

8 common failure modes with detection signals:

FM-1: Sycophancy cascade
FM-2: All-BROKEN performative adversariality
FM-3: Drift (silent goal-departure)
FM-4: Single-agent domination
FM-5: Claim inflation or starvation
FM-6: Verification bypass
FM-7: Stress test avoidance
FM-8: Round budget overrun

Full catalog with detection signals and responses: references/failure-modes.md.

How to use this skill

When triggered, walk the user through 5 stages:

Stage 1 — Domain elicitation

Ask:

What system are you building / evaluating?
What's the deliberation question? (specific, not "let's discuss X")
What 4 perspectives make sense for your domain? (See role-schemas.md cross-domain examples)
What's the round budget? (8 minimum, 13 standard)

Stage 2 — Role configuration

For each of the 4 roles, instantiate for the user's domain:

Action: who/what produces behavioral traces? (typically: implementation lead)
Guardian: who/what specifies invariants? (typically: domain expert / safety / compliance)
Observer: who/what represents the operator? (typically: user-facing role / customer)
Critic: who/what attacks? (typically: red team / failure-mode adversary)

Use templates/role-prompt.md.tmpl as starting prompts.

Stage 3 — Round template configuration

Configure:

Round budget (typically 13: 10 normal + 2 stress + 1 synthesis)
Stress test schedule
Drift check interval (typically every 5 rounds)

Use templates/round-template.md.tmpl as the per-round structure.

Stage 4 — Run rounds

For each round:

Run the 8-phase template (A-H)
After Phase E, run scripts/claims-validator.py to verify integrity
Every 3-4 rounds, run scripts/stopping-detector.py to evaluate stop signals

Use scripts/round-controller.py as orchestration skeleton (adapt to your LLM API).

Stage 5 — Synthesis and retrospective

When stop signals satisfied (≥4/6) OR round budget hit:

Run a synthesis round (no new claims; resolve pending claims; produce summary)
Write retrospective using the failure-modes catalog as a checklist
Document any iteration log entries (what calibration revealed)

Quick example

User: "I want to evaluate 3 candidate architectures for our agent system, but free-form discussion keeps converging too fast — I think we're sycophanting."

Stage 1 reveals:

System: agent architecture comparison (3 candidates)
Question: which architecture is most robust under load + adversarial conditions?
Round budget: 13 (standard)

Stage 2 instantiates 4 roles:

Action: senior backend engineer (produces behavioral traces of each candidate under load)
Guardian: SRE lead (invariants around uptime / data integrity / observability)
Observer: customer support lead (walkthroughs of debugging the system as on-call)
Critic: security lead (adversarial attack hypotheses)

Stage 3 sets:

13-round budget; stress tests at R5/R10/R12/R13
Drift checks at R5 and R10

Stage 4 runs 13 rounds. After R7, sycophancy detector triggers (Signal 2 negative slope). Response: tighten Critic schema, R8 produces a BROKEN verdict on Action's R3 trace.

Stage 5 synthesis: Architecture B emerges as winner not because of consensus, but because:

Survived 4 stress tests with signature degradation matching predictions (high C2 in MADEF terms)
Critic attacks on B reproduced as ROBUST 4 of 5 times (vs A: 1/5, C: 2/5)
Guardian's invariant ("recovery time < 30s under partial outage") only B passed reliably
Operator walkthroughs found friction at the same place across 3 different operator personas — meaningful signal, not noise

Full walkthrough: examples/condensed-deliberation.md.

How the skill behaves at each turn

Don't dump the full 4-role schema at the user upfront. Walk them through Stage 1 elicitation first.
Don't start running rounds (Stage 4) before Stage 2 (role configuration) is settled. Roles improperly cast produce noise.
Do push back if the user wants to skip stress tests "to save rounds". Stress tests are the load-bearing-ness check; without them, the deliberation can't be evaluated.
Do run the validator (claims-validator.py) after every Phase F. Errors here cascade.
Do trigger sycophancy alarm if Signal 2 (disagreement slope) goes negative for 2+ rounds. Don't wait for the deliberation to end.

References

references/role-schemas.md — full 4-role definitions + cross-domain instantiations
references/verification-protocol.md — 4 cross-validation checks
references/claims-infrastructure.md — jsonl schemas
references/stopping-criteria.md — 6 stop signals
references/failure-modes.md — 8 failure modes catalog

Templates

templates/role-prompt.md.tmpl — parameterized 4-role system prompts
templates/round-template.md.tmpl — 8-phase round structure
templates/artifact-schemas/ — per-role artifact schemas (action / guardian / observer / critic)

Scripts

scripts/claims-validator.py — JSONL integrity check (run after Phase F each round)
scripts/stopping-detector.py — evaluate 6 stop signals (run every 3-4 rounds)
scripts/round-controller.py — orchestration skeleton (adapt to your LLM API)

Examples

examples/condensed-deliberation.md — 5-round walkthrough on a fictional architecture comparison
examples/stress-test-walkthrough.md — what stress rounds reveal (R12/R13 patterns)

What this skill does NOT do

It does not run the LLM calls for you — round-controller.py is a skeleton; you integrate with your LLM API
It does not validate that your 4 roles are domain-correct — that's a calibration question
It does not work for single-agent deliberation (use chain-of-thought + self-critique instead)
It does not produce a single ranking score — outputs are claims, verifications, and decisions with provenance, not a scalar

Source

This skill captures a methodology developed across multiple multi-agent deliberation experiments, frozen as v1 after running on multiple calibration cases. The 4-role structure, 4 cross-validation checks, 6 stopping signals, and stress test patterns all emerged from calibration findings.

Companion skill: multi-dim-eval-framework — for evaluating deliberation quality (MADEF). The two skills compose: this skill produces deliberation; that skill evaluates it.

License

MIT

structured-deliberation

Safety Notice

Copy this and send it to your AI assistant to learn