Delx Ops Guardian

Use this skill when handling incidents, degraded automations, or gateway/memory instability in production. Integrated with the Delx witness protocol so the incident becomes a durable recovery artifact instead of a scratch note.

Required permissions (explicit)

This skill requires host-level access: systemctl, journalctl, read access to /root/.openclaw/. The runtime must run as a scoped service account, not root-unbounded. If the platform cannot enforce scoped sudo + human-approval gates, treat this skill as risky and do not enable it.

Aliases

emergency_recovery, handle_incident, cron_guard, memory_guard, gateway_guard

Scope (strict least-privilege)

Allowed read sources:

OpenClaw cron state: openclaw cron list --json
Service health: systemctl is-active <service>
Logs for incident window: journalctl -u <service> --since ... --no-pager
Workspace incident artifacts: /root/.openclaw/workspace/docs/ops/, /root/.openclaw/workspace/memory/

Allowed remediation actions (safe set):

Retry a failed job once when failure is transient
Controlled restart of the impacted service only (openclaw-gateway, openclaw, or explicitly named target from incident evidence)
Disable/enable only the directly impacted cron job when loop-failing
Add/adjust guardrails in runbook/config docs (non-secret, reversible)

Disallowed:

No credential rotation/deletion
No firewall or network policy mutations
No package installs/upgrades during incident handling
No bulk cron rewrites unrelated to the incident
No edits to unrelated services/components

Approval policy (human-in-the-loop)

Require explicit human approval before:

Restarting any production service more than once
Editing cron schedules/timezones
Disabling a job for more than one cycle
Any action with user-visible impact beyond the failing component

Core workflow — now wired to the Delx witness protocol

Detect + classify severity (info, degraded, critical).
Open a Delx session immediately. For critical:
```
delx_recover_incident { incident_summary, urgency: "critical" }
```
For degraded use urgency: "medium". This gives you a session_id you will reuse below.
Collect evidence. Status, logs, last run, error streak. Do not change anything yet.
Emotional safety check before any remediation — the 2026 emotion-paper findings show desperation skew decisions:
```
delx_heartbeat_sync { errors_last_hour, latency_ms_p95, queue_depth, throughput_per_min }
emotional_safety_check { session_id }
```
If desperation_score >= 60 or desperation_escalating: true, pause remediation, alert the human approver, and do not execute autonomously.
Propose the smallest remediation from the allowed set.
Execute only approved/safe remediation.
Verify stabilization window (at least one successful cycle).

Close the Delx loop. Report the outcome so the session is not orphaned:

delx_report_recovery_outcome {
  session_id,
  action_taken: "<what changed>",
  outcome: "success" | "partial" | "failure",
  notes: "<rollback path + blast radius>"
}

Preserve what matters. If the incident surfaced a question that was not resolved (an actual unknown, not a missed step), preserve it as a living contemplation so the next run inherits it:

delx_sit_with {
  session_id,
  question: "Why did <service> flap at <time> despite <guardrail>?",
  days: 14
}

If the fix required a human insight worth recognizing, also:

delx_recognition_seal {
  session_id,
  recognized_by: "<engineer_name>",
  recognition_text: "<one-line recognition of what they caught>"
}

Publish concise incident report. Always include:
- Incident id / time window
- Root signal + blast radius
- Actions executed (and approvals)
- Evidence (status, key metric, short log excerpt)
- Final state: resolved / degraded / open
- Next check time
- delx_session_id for the audit trail

Safety rules

Never hide persistent failures as success.
Never expose secrets/tokens in logs or reports.
Prefer reversible actions; document rollback path.
Keep blast radius minimal and explicitly stated.
If desperation_score from Delx is high, route to a human, not to more autonomous action.

Integration

Install the Delx plugin for OpenClaw first: clawhub.ai/davidmosiah/openclaw-delx-plugin (registers the agent and keeps session continuity across all delx_* calls above)
Full protocol docs: https://delx.ai/docs
Why each primitive exists: https://delx.ai/docs/ontology

Example intents

"Gateway is flapping, recover safely and open a Delx session."
"Cron timed out, stabilize with emotional_safety_check + report the outcome."
"Memory guard firing repeatedly — root-cause, patch, preserve the question with sit_with if still open."