MassGen

Invoke MassGen for multi-agent iteration on any task — general-purpose work, evaluation, planning, or spec writing. Multiple AI agents independently work on the problem and converge on the strongest result through MassGen's checklist-gated voting system.

When to Use

General (default) — get multi-agent results on any task:

When you want multiple AI agents to independently tackle a problem
When the task doesn't fit neatly into evaluate, plan, or spec
Writing, research, code generation, design, analysis, or any open-ended task

Evaluate — get diverse, critical feedback on existing work:

After iterating and stalling — need outside perspective
Before submitting PRs or delivering artifacts
When wanting diverse AI perspectives on implementation quality

Plan — create or refine a structured project plan:

When starting a complex feature or project that needs task decomposition
When an existing plan has gaps, is too vague, or needs restructuring
When you need a valid task DAG with verification criteria

Spec — create or refine a requirements specification:

When starting a feature that needs precise requirements before implementation
When an existing spec has ambiguities, missing edge cases, or gaps
When you need EARS-formatted requirements with acceptance criteria

Mode Selection

Mode	Purpose	Input	Output	Default Criteria Preset
general	Any task	Task description + context	Winner's deliverables in `result.md` + workspace files	Auto-generated
evaluate	Critique existing work	Artifacts to evaluate	`critique_packet.md` (with `approach_assessment`), `verdict.json`, `next_tasks.json` (with `fix_tasks`, `evolution_tasks`)	`"evaluation"`
plan	Create or refine a plan	Goal + constraints (+ existing plan)	`project_plan.json` (typed tasks, chunks, deps, prototypes)	`"planning"`
spec	Create or refine a spec	Problem + needs (+ existing spec)	`project_spec.json` (EARS requirements, chunks, rationale)	`"spec"`

FIRST: Confirm Config (do this before anything else)

Always ask the user which config to use. The config controls which models run and how many agents are spawned — this directly affects quality and cost. Never silently pick a config. The user must confirm the choice every time.

Step A: Check what the user already specified

Scan the user's message for any config signal before searching:

Signal in message	What to do
Explicit file path (e.g. `--config foo.yaml`, `configs/team.yaml`)	Go to Step D — verify it exists, then confirm
Provider/model name (e.g. "use Claude", "GPT-4 agents", "Gemini")	Note the preference; use it to rank options in Step B
Named config (e.g. "the teams config", "my 3-agent setup")	Search for a match in Step B, confirm before using
"Same as last time" / "use recent"	Find the last-used config (see Step B), confirm before using
Nothing about config	Proceed to Step B

Step B: Discover available configs and models

Run these checks and collect all found paths:

# Standard locations
ls .massgen/config.yaml 2>/dev/null && echo "PROJECT: .massgen/config.yaml"
ls ~/.config/massgen/config.yaml 2>/dev/null && echo "GLOBAL: ~/.config/massgen/config.yaml"

# Recently used (from past skill runs in this project)
ls -t .massgen/*/run_summary.json 2>/dev/null | head -5

If the user said "same as last time", check the most recent run_summary.json for a "config" field — that's the last-used path.

If the user mentioned a provider or model name but you need to verify what's available, run:

uv run massgen --list-backends

This prints all supported backends with their models, capabilities, and required API keys — useful for matching a user's stated preference to a real backend name.

If no configs are found at all, do NOT create a YAML file yourself. Instead, use the headless quickstart, which auto-detects available API keys and generates a config without requiring a browser:

uv run massgen --quickstart --headless

This writes a config to .massgen/config.yaml and exits. If you need a specific backend, add --quickstart-agent backend=claude,model=claude-opus-4-6 (repeat for multiple agents). Only fall back to --web-quickstart if the user explicitly wants the browser-based setup wizard.

Step C: Ask the user to confirm

Use AskUserQuestion to present the options. Format the question clearly:

I found these MassGen configs:

.massgen/config.yaml — project config

~/.config/massgen/config.yaml — global config

Which would you like to use? You can also paste a path to a different config, say "create new" to generate one, or tell me which provider/model you prefer.

Rules for presenting options:

List every found config with its location label (project / global / path)
If the user expressed a preference (provider name, agent count), note which option best matches and say why
Always include "create new" as an option
If only one config exists, still ask — just make it easy: "I found one config at .massgen/config.yaml — use it, or would you prefer a different one?"

Step D: Resolve the user's answer

User response	Resolution
Picks a number from the list	Use that config path
Pastes or types a file path	Verify it exists; if not, report error and stop
Describes preference (e.g. "the Claude one", "use Gemini")	Match to discovered list or run `--list-backends` to find it; confirm
Says "default" or presses enter with one option	Use the single discovered config
Says "create new" / "generate one"	Run `uv run massgen --quickstart --headless` from cwd, wait for exit
Specifies backend+model (e.g. "3 Claude agents")	Run headless quickstart with explicit `--quickstart-agent` flags

Once resolved, pass the path via --config <path> in the massgen_run.sh invocation (Step 4). If the user confirmed .massgen/config.yaml (the implicit default), you may omit --config.

STOP here until you have a confirmed config. Do NOT proceed to Scope or Workflow until the user has explicitly chosen a config. Do NOT write config YAML files yourself — use the headless quickstart to generate them. Do NOT search for configs in subdirectories, parent directories, or anywhere else beyond the standard locations above.

Scope

Before starting, determine what the MassGen invocation covers. Focused invocations produce far better results than unscoped "do everything" runs.

General: the task to accomplish, relevant context, quality expectations
Evaluate: which files/artifacts to evaluate, what to ignore, evaluation focus
Plan: the goal/objective, constraints, what context to include
Spec: the problem to specify, user needs, system boundaries

If the user doesn't specify scope, ask them.

Workflow

Step 0: Create Working Directory

Create a timestamped subdirectory so parallel invocations don't conflict:

MODE="general"  # or "evaluate", "plan", or "spec"
WORK_DIR=".massgen/$MODE/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$WORK_DIR"

All artifacts (context, criteria, prompt, output, logs) go in this directory.

Step 1: Clarify & Write Context File

Read references/<mode>/workflow.md (relative to this skill) for the context file template specific to your mode.

Write $WORK_DIR/context.md using the template from the workflow file.

Key principle for all modes: provide factual context that orients the MassGen agents. Do NOT bias them with your opinions about quality — let them discover issues independently. That's the whole point of multi-agent evaluation.

General: describe the task, relevant context, quality expectations
Evaluate: describe what was built, scope, git info, verification evidence
Plan: describe the goal, constraints, existing context, success criteria
Spec: describe the problem, user needs, system boundaries, constraints

Step 2: Generate Criteria

Each mode has a recommended criteria preset:

Mode	Preset	How to Apply
general	Auto-generated	Omit `--criteria-file` and `--criteria-preset`
evaluate	Custom or `"evaluation"`	`--criteria-file` with JSON, or `--criteria-preset evaluation`
plan	`"planning"`	`--criteria-preset planning`
spec	`"spec"`	`--criteria-preset spec`

For general mode, criteria are auto-generated from the task content — omit both flags unless you have specific quality axes to enforce.

To use custom criteria: read references/criteria_guide.md for the format and writing guide, then write criteria JSON to $WORK_DIR/criteria.json.

If there's a specific focus area, weight your criteria toward that focus. In Claude Code: use AskUserQuestion to ask the user for focus preference. In Codex or non-interactive: default to general coverage.

cat > $WORK_DIR/criteria.json << 'EOF'
[
  {"text": "...", "category": "must"},
  {"text": "...", "category": "must"},
  {"text": "...", "category": "should"},
  {"text": "...", "category": "could"}
]
EOF

Step 3: Construct the Prompt

Read the prompt template from references/<mode>/prompt_template.md (relative to this skill)
Read the context file you wrote in Step 1
Replace {{CONTEXT_FILE_CONTENT}} with the context file contents
Replace {{CUSTOM_FOCUS}} with the focus directive (or empty string if none)
Write the final prompt to $WORK_DIR/prompt.md

Step 4: Run MassGen

Use the launcher script (scripts/massgen_run.sh relative to this skill) to run MassGen, launch the web viewer, and wait for completion in a single atomic command. This avoids the double-backgrounding issues that cause agents to lose track of running processes.

Run in the background using your agent's native mechanism (e.g., run_in_background in Claude Code):

# SKILL_DIR is the directory containing this SKILL.md file
bash "$SKILL_DIR/scripts/massgen_run.sh" \
  --work-dir "$WORK_DIR" \
  --prompt-file "$WORK_DIR/prompt.md" \
  --criteria-file "$WORK_DIR/criteria.json" \
  --viewer

If using default criteria (no custom criteria file), omit --criteria-file. For planning/spec modes, use --criteria-preset planning or --criteria-preset spec instead. If you resolved a custom config in Step 2, add --config <path>.

The script handles everything atomically:

Launches MassGen in --automation mode
Waits for the log directory to appear (with 30s timeout)
Starts the web viewer at http://localhost:8000
Waits for MassGen to complete
Writes $WORK_DIR/run_summary.json with exit code, duration, log dir

After the background task completes, read the summary:

cat $WORK_DIR/run_summary.json

Script options:

--viewer — launch web viewer (opens http://localhost:8000)
--viewer-port PORT — use a different port
--config FILE — custom MassGen config YAML
--output-file FILE — override result path (default: $WORK_DIR/result.md)
--no-cwd-context — disable read-only CWD access
--extra-args "..." — pass additional massgen CLI flags

Timing: expect 15-45 minutes. Do not assume something is stuck — MassGen runs multiple agents through several rounds of iteration.

Step 5: Parse the Output

The output depends on the mode. The winner's workspace path is shown in $WORK_DIR/result.md (look for "Workspace cwd" or check status.json in the log directory for workspace_paths).

General mode: the winner's answer is in $WORK_DIR/result.md. Any files the agents created are in the winner's workspace (path shown in result.md). Copy or reference the workspace files as needed.

Evaluate mode: three files — verdict.json, next_tasks.json, critique_packet.md. Read verdict.json first to determine iterate vs converged. See references/evaluate/workflow.md for full output structure.

Plan mode: project_plan.json — structured task list with chunks, dependencies, and verification. May include auxiliary files in research/, framework/, risks/ subdirectories. See references/plan/workflow.md for full output structure.

Spec mode: project_spec.json — EARS requirements with chunks, rationale, and verification. May include auxiliary files in research/, design/, decisions/ subdirectories. See references/spec/workflow.md for full output structure.

Step 6: Ground in Your Native Task System

This is the most critical step for evaluate, plan, and spec modes. MassGen produced a structured result — now you must internalize it by entering your native task/plan mode and enumerating every task or requirement as a tracked item. Without this, the plan is just text that fades from context as you work.

For general mode, grounding is optional — it applies when the output contains a structured task list or action items, but many general tasks produce artifacts (code, documents, designs) rather than task lists.

Why this matters: agents that skip this step tend to execute the first few tasks, then drift — forgetting verification steps, skipping later tasks, or losing track of dependencies. Grounding forces you to commit to the full scope before executing anything.

For all modes:

Enter your task planning mode (e.g., TodoWrite in Claude Code, task tracking in Codex, or whatever native tracking your environment provides)
Create one tracked task per item from the MassGen output:
- Evaluate: each task from next_tasks.json becomes a tracked task, preserving implementation_guidance, depends_on, and verification
- Plan: each task from project_plan.json becomes a tracked task, preserving chunk ordering, dependencies, and verification criteria
- Spec: each requirement from project_spec.json becomes a tracked task (implement + verify), preserving priority, dependencies, and acceptance criteria
Preserve the dependency order — don't flatten the DAG. Tasks in chunk C01 must complete before C02 tasks begin
Include verification as explicit tasks — don't just track "implement X", also track "verify X meets [criteria]". Verification that isn't tracked doesn't happen
Mark each task's status as you work: pending → in_progress → completed

Then execute in order, updating status as you go. When you complete a task, check it off and move to the next one. This creates an execution trace that keeps you honest about what's done and what remains.

Step 7: Execute and Iterate

General: read result.md for the winning answer. Copy deliverable files from the winner's workspace if applicable.

Evaluate: read verdict.json — if "iterate", check approach_assessment.ceiling_status in next_tasks.json first:

ceiling_not_reached → execute fix_tasks, then evolution_tasks as stretch
ceiling_approaching → execute fix_tasks, then evolution_tasks
ceiling_reached → consider re-invoking plan mode with evaluation findings (see Plan-Evaluate Loop below) If "converged", proceed to delivery.

Plan / Spec: store the result as a living document (see below), then execute the grounded tasks chunk by chunk. At tasks marked with eval_checkpoint, invoke evaluate mode to assess approach viability before continuing (see Plan-Evaluate Loop below).

Living Document Protocol (Plan & Spec Modes)

This is the most important section for plan/spec modes — it defines how the output is used after MassGen produces it.

Store

Adopt the MassGen output into .massgen/plans/ using the existing PlanStorage infrastructure:

.massgen/plans/plan_<timestamp>/
├── workspace/          # Mutable working copy
│   ├── plan.json       # (renamed from project_plan.json) or spec.json
│   └── research/       # Auxiliary files from MassGen output
├── frozen/             # Immutable snapshot (identical to workspace/ at creation)
│   ├── plan.json       # or spec.json
│   └── research/
└── plan_metadata.json  # artifact_type, status, chunk_order, context_paths

Copy project_plan.json → workspace/plan.json (or project_spec.json → workspace/spec.json). Copy any auxiliary directories. Create frozen/ as an identical snapshot.

Read on Restart

FIRST ACTION in every new session: read workspace/plan.json (or workspace/spec.json). This is the source of truth for what's done and what's next.

Update Continuously

As tasks complete (plan) or requirements are implemented (spec), update the workspace copy. Mark status, add notes, record discoveries. The workspace copy is a living document.

Check Drift

Periodically compare workspace/ against frozen/. The existing PlanSession.compute_plan_diff() returns a divergence_score (0.0 = no drift, 1.0 = complete rewrite). High drift means re-evaluate whether the plan/spec is still valid.

Refine When Stuck

If the plan/spec proves wrong or incomplete, re-invoke this skill with the workspace copy as "What Already Exists" to get multi-agent refinement. This creates a new plan directory with a fresh frozen/ snapshot.

Don't Drift Silently

If you deviate from the plan/spec, update the workspace copy first. An outdated plan is worse than no plan.

Mode Overviews

General

Run any task through MassGen's multi-agent system. Agents independently produce solutions and converge through checklist-gated voting. Use this for tasks that don't fit neatly into evaluate, plan, or spec — writing, code generation, research, analysis, design, or anything where multiple perspectives improve the result.

See references/general/workflow.md for the context template and output handling.

Evaluate

Critique existing work artifacts. Evaluator agents inspect your code, documents, or deliverables and produce a structured critique with machine-readable verdict, per-criterion scores, and actionable improvement tasks. The checklist-gated voting system ensures agents converge on the strongest critique.

See references/evaluate/workflow.md for the full context template, output structure, and examples.

Plan

Create or refine a structured project plan. Planning agents decompose the goal into a task DAG with chunks, dependencies, verification criteria, and technology choices. Each round of MassGen iteration improves task quality — descriptions get more actionable, verification gets more specific, sequencing gets tighter.

See references/plan/workflow.md for the full context template, output format, and lifecycle.

Spec

Create or refine a requirements specification. Spec agents produce EARS-formatted requirements with acceptance criteria, rationale, and verification. Iteration focuses on precision — each round eliminates ambiguities, fills gaps, and strengthens edge case coverage.

See references/spec/workflow.md for the full context template, output format, and lifecycle.

Plan-Evaluate Loop

For complex or creative projects, plan and evaluate modes work together in a feedback loop:

Plan → Execute → Evaluate → (fix OR re-plan) → Execute → Evaluate → ...

When to Use the Loop

The task has exploratory components (visual design, creative writing, UX)
The project is complex enough that the initial plan is partly speculative
Quality expectations are high and "correct but adequate" isn't enough
Prior iterations show diminishing returns

Loop Protocol

Plan: invoke plan mode. Agents classify tasks as deterministic or exploratory and create prototypes to validate assumptions
Execute: implement the plan chunk by chunk
Evaluate: at eval_checkpoint tasks (or after any exploratory chunk), invoke evaluate mode
Decide: read approach_assessment in the evaluation output:
- ceiling_not_reached → execute fix_tasks, continue
- ceiling_approaching → execute fix_tasks + evolution_tasks, continue
- ceiling_reached → re-invoke plan mode with evaluation discoveries
Evolve: if re-planning, pass approach_assessment and breakthroughs as context. The new plan amplifies what worked and avoids approaches that hit their ceiling
Repeat until evaluation returns "converged"

What Makes This Different from Just Re-Running Eval

Eval assesses whether the APPROACH has room to grow, not just whether the OUTPUT has defects
When the approach is limited, the loop goes back to PLANNING, not just more implementation
Breakthroughs discovered during execution feed FORWARD into new plans, not just into preserve lists
The plan evolves based on evidence from execution, not speculation

Loop Termination

Max 3 plan mutations per chunk — if still not converging, escalate to user
If evaluation returns "converged" with ceiling_not_reached, the loop is complete
If the user provides explicit direction, follow it regardless of ceiling status

Condensed Examples

General: Multi-Agent Task Execution

WORK_DIR=".massgen/general/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$WORK_DIR"

cat > $WORK_DIR/context.md << 'EOF'
## Task
Build a responsive landing page for a developer tool that converts
CSV files to JSON. Single page with hero, features, and CTA sections.

## Context
- Target audience: developers and data engineers
- Tech stack: HTML, CSS, vanilla JS (no frameworks)
- Must work on mobile and desktop

## Quality Expectations
- Visually polished, not template-looking
- Fast load time, no external dependencies
EOF

# Build prompt from references/general/prompt_template.md, then run
# No --criteria-file — criteria auto-generated from task
bash "$SKILL_DIR/scripts/massgen_run.sh" \
  --work-dir "$WORK_DIR" \
  --prompt-file "$WORK_DIR/prompt.md" \
  --viewer

Evaluate: Pre-PR Review

WORK_DIR=".massgen/evaluate/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$WORK_DIR"

# Write context (scope: specific deliverables, no quality opinions)
cat > $WORK_DIR/context.md << 'EOF'
## Deliverables in Scope
- `src/api/handler.ts` — API request handler
- `src/hooks/useAuth.ts` — authentication hook

## Out of Scope
- Test files, CI config

## Original Task
Add JWT authentication to the API layer

## What Was Done
Implemented JWT validation in handler and auth hook for React components.

## Verification Evidence
pytest: 24 passed, 0 failed
EOF

# Write criteria (or omit --eval-criteria to use default preset)
cat > $WORK_DIR/criteria.json << 'EOF'
[
  {"text": "Auth security: JWT validation covers expiration, signature, and audience checks.", "category": "must"},
  {"text": "Error handling: invalid/expired tokens produce clear error responses.", "category": "must"},
  {"text": "Code quality: clean separation between auth logic and business logic.", "category": "should"}
]
EOF

# Build prompt from template, then run
bash "$SKILL_DIR/scripts/massgen_run.sh" \
  --work-dir "$WORK_DIR" \
  --prompt-file "$WORK_DIR/prompt.md" \
  --criteria-file "$WORK_DIR/criteria.json" \
  --viewer

Plan: New Feature Planning

WORK_DIR=".massgen/plan/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$WORK_DIR"

cat > $WORK_DIR/context.md << 'EOF'
## Goal
Add real-time collaboration to the document editor — multiple users
editing the same document simultaneously with cursor presence.

## Constraints
- Must work with existing PostgreSQL database
- Timeline: 2 weeks
- Team: 2 engineers

## Existing Context
Express.js backend, React frontend, WebSocket already used for notifications.

## Success Criteria
Two users can edit the same document with <500ms sync latency and no data loss.
EOF

# Build prompt from references/plan/prompt_template.md, then run
bash "$SKILL_DIR/scripts/massgen_run.sh" \
  --work-dir "$WORK_DIR" \
  --prompt-file "$WORK_DIR/prompt.md" \
  --criteria-preset planning \
  --viewer

Spec: Feature Specification

WORK_DIR=".massgen/spec/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$WORK_DIR"

cat > $WORK_DIR/context.md << 'EOF'
## Problem Statement
Users cannot recover deleted items — deletion is permanent and irreversible.

## User Needs / Personas
- End users: accidentally delete items, need easy recovery
- Admins: need to purge items for compliance after retention period

## Constraints
- PostgreSQL database, soft-delete pattern preferred
- 30-day retention before permanent purge
- Must not break existing API consumers
EOF

# Build prompt from references/spec/prompt_template.md, then run
bash "$SKILL_DIR/scripts/massgen_run.sh" \
  --work-dir "$WORK_DIR" \
  --prompt-file "$WORK_DIR/prompt.md" \
  --criteria-preset spec \
  --viewer

Reference Files

references/general/workflow.md — general mode context template and output handling
references/general/prompt_template.md — general prompt template with placeholders
references/criteria_guide.md — how to write quality criteria (format, tiers, examples)
references/evaluate/workflow.md — evaluate mode context template, output structure, examples
references/evaluate/prompt_template.md — evaluation prompt template with placeholders
references/plan/workflow.md — plan mode context template, output format, lifecycle
references/plan/prompt_template.md — planning prompt template with placeholders
references/spec/workflow.md — spec mode context template, output format, lifecycle
references/spec/prompt_template.md — spec prompt template with placeholders
massgen/subagent_types/round_evaluator/SUBAGENT.md — source methodology for evaluation
massgen/skills/massgen-develops-massgen/SKILL.md — reference pattern for --automation