sre-runbooks

Purpose

This skill delivers standardized SRE runbooks for managing incidents and performing system maintenance in DevOps environments. It accesses pre-defined procedures to ensure consistent responses to issues like outages or upgrades.

When to Use

Use this skill during active incidents for quick reference, routine maintenance tasks, or when onboarding new SRE team members. Apply it in production environments facing downtime, scaling issues, or compliance checks to follow best practices.

Key Capabilities

Retrieve runbooks by ID, type, or keyword (e.g., "outage" or "database").
Execute automated steps from runbooks, such as running scripts or API calls.
Generate custom checklists based on runbook templates for specific environments.
Integrate with monitoring tools to trigger runbooks on alerts.
Support versioning of runbooks for tracking changes over time.

Usage Patterns

Invoke this skill via the sre-cli tool or API calls from your AI agent code. Always set the environment variable $SRE_API_KEY for authentication before use. For CLI, prefix commands with authentication checks. In code, import the skill as a module and call functions with required parameters. Example pattern: Check if the runbook exists first, then execute it in a try-catch block.

Common Commands/API

Use the sre-cli for direct access or the REST API for programmatic integration. All commands require $SRE_API_KEY set in the environment.

CLI Command: List runbooks

sre-cli runbook list --filter sre --output json

This fetches a JSON list of SRE-tagged runbooks; use --filter for tags like "devops".

CLI Command: Get a specific runbook

sre-cli runbook get --id 456 --format markdown

Retrieves runbook ID 456 in Markdown; add --format yaml for YAML output.

API Endpoint: Retrieve runbook

GET https://api.openclaw.com/sre-runbooks/v1/runbooks/{id}

Headers: Authorization: Bearer $SRE_API_KEY

Response: JSON object with runbook details, e.g., {"id": 456, "steps": ["Step 1: Check logs"]}.

API Endpoint: Execute runbook step

POST https://api.openclaw.com/sre-runbooks/v1/runbooks/{id}/execute

Body: {"step": 1, "params": {"environment": "prod"}}

This runs the first step of the runbook with production parameters.

Config formats use JSON or YAML for inputs. Example config file (runbook-config.json ):

{

"id": "789",

"environment": "dev",

"overrides": {"timeout": 300}

}

Integration Notes

Integrate this skill with your AI agent's workflow by adding it as a dependency in your codebase. Use $SRE_API_KEY for auth; set it via environment variables in deployment scripts. For example, in a Node.js app, install via npm install sre-runbooks-sdk , then import and initialize:

const sreSDK = require('sre-runbooks-sdk');

sreSDK.init({ apiKey: process.env.SRE_API_KEY, baseUrl: 'https://api.openclaw.com' });

Ensure your agent handles rate limits (e.g., 100 requests/min) by implementing retry logic with exponential backoff. For CI/CD, embed runbooks in pipelines using webhooks to trigger on Git events.

Error Handling

Always wrap skill invocations in error handlers. For CLI commands, check exit codes and parse stderr. For API calls, catch HTTP errors (e.g., 401 for auth failures, 404 for missing runbooks). Example in Python:

import sre_runbooks

try:

response = sre_runbooks.get_runbook(id=123, api_key=os.environ['SRE_API_KEY'])

except sre_runbooks.AuthError:

log_error("Authentication failed; check $SRE_API_KEY")

except sre_runbooks.NotFoundError:

fallback_to_default_runbook()

Common issues: Invalid API keys return 401; handle by prompting for re-auth. Rate limit errors (429) require a 5-second delay before retrying.

Concrete Usage Examples

Example 1: Handle a database outage

First, query for the relevant runbook: sre-cli runbook get --id 123 . Then, execute the steps:

steps = sre_runbooks.execute(id=123, params={"db_type": "postgres"})

This retrieves and runs the outage procedure, outputting logs for review.

Example 2: Perform routine system maintenance

Use the API to schedule a maintenance runbook:

POST https://api.openclaw.com/sre-runbooks/v1/runbooks/456/schedule

Body: {"time": "2023-10-01T02:00:00Z", "params": {"systems": ["web", "cache"]}}

This automates updates, ensuring systems are patched without downtime.

Graph Relationships

Related to: devops-tools (for shared DevOps utilities)
Depends on: monitoring-skills (for alert integration)
Clusters with: devops-sre (primary cluster for this skill)

sre-runbooks

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

defi-protocols

cs-crypto

nft

langchain