devops-automation

DevOps and IT Ops automation - CI/CD, monitoring, incident management, and infrastructure workflows

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "devops-automation" with this command: npx skills add claude-office-skills/skills/claude-office-skills-skills-devops-automation

DevOps Automation

Automate DevOps workflows including CI/CD pipelines, monitoring, incident management, and infrastructure operations. Based on n8n's IT Ops workflow templates.

Overview

This skill covers:

  • CI/CD pipeline automation
  • Monitoring and alerting
  • Incident management
  • Infrastructure automation
  • Deployment workflows

CI/CD Automation

GitHub Actions Integration

workflow: "GitHub CI/CD Notifications"

triggers:
  - github_push
  - github_pull_request
  - github_workflow_run
  
on_push:
  action:
    - trigger_ci: if_main_branch
    - notify_slack:
        channel: "#deployments"
        message: |
          📦 *New Push to {branch}*
          
          Commit: `{commit_sha_short}`
          Author: {author}
          Message: {commit_message}
          
          [View Diff]({compare_url})

on_pr_opened:
  action:
    - notify_slack:
        channel: "#code-review"
        message: |
          🔀 *New Pull Request*
          
          Title: {pr_title}
          Author: {author}
          Branch: {head} → {base}
          
          [Review PR]({pr_url})
    - assign_reviewers: based_on_codeowners
    - run_ci_checks

on_workflow_complete:
  action:
    - notify_slack:
        message: |
          {status_emoji} *Build {status}*
          
          Workflow: {workflow_name}
          Branch: {branch}
          Duration: {duration}
          
          {if_failed: [View Logs]({logs_url})}

Deployment Pipeline

deployment_pipeline:
  stages:
    build:
      trigger: push_to_main
      steps:
        - checkout_code
        - install_dependencies
        - run_tests
        - build_artifact
        - push_to_registry
        
    staging:
      trigger: build_success
      steps:
        - deploy_to_staging
        - run_integration_tests
        - notify_qa
        
    production:
      trigger: manual_approval
      steps:
        - create_backup
        - deploy_to_production
        - run_smoke_tests
        - notify_team
        
  rollback:
    trigger: deployment_failed OR manual
    steps:
      - revert_to_previous
      - notify_team
      - create_incident

Monitoring & Alerting

Alert Routing

alert_routing:
  sources:
    - prometheus
    - datadog
    - cloudwatch
    - new_relic
    
  severity_levels:
    critical:
      response_time: 5_minutes
      channels: [pagerduty, slack_urgent, sms]
      escalation: immediate
      
    high:
      response_time: 15_minutes
      channels: [slack_alerts, email]
      escalation: after_15_minutes
      
    medium:
      response_time: 1_hour
      channels: [slack_alerts]
      
    low:
      response_time: 24_hours
      channels: [slack_logging]
      
  routing_rules:
    - if: service == "payments"
      team: payments_oncall
      severity_boost: +1
      
    - if: service == "auth"
      team: security_oncall
      
    - default:
      team: platform_oncall

Alert Templates

alert_templates:
  infrastructure:
    cpu_high:
      title: "🔥 High CPU Usage"
      body: |
        Server: {host}
        CPU: {cpu_percent}%
        Duration: {duration}
        
        Threshold: {threshold}%
        
        [View Dashboard]({grafana_url})
        
    memory_critical:
      title: "💾 Critical Memory"
      body: |
        Server: {host}
        Memory: {memory_percent}%
        Available: {available_mb}MB
        
        [SSH to Server]({ssh_link})
        
    disk_full:
      title: "💿 Disk Space Critical"
      body: |
        Server: {host}
        Disk: {disk_percent}%
        Available: {available_gb}GB
        
        Suggestion: Clean logs or expand volume
        
  application:
    error_spike:
      title: "📈 Error Rate Spike"
      body: |
        Service: {service}
        Error Rate: {error_rate}%
        Normal: {baseline}%
        
        Top Errors:
        {top_errors}
        
    latency_high:
      title: "🐢 High Latency"
      body: |
        Service: {service}
        P99 Latency: {p99_ms}ms
        Threshold: {threshold_ms}ms

Incident Management

Incident Workflow

incident_workflow:
  detection:
    sources: [monitoring, user_report, automated_check]
    
  triage:
    auto_severity:
      - if: affects_payments
        severity: critical
      - if: affects_auth
        severity: critical
      - if: affects_api AND error_rate > 10%
        severity: high
        
  response:
    critical:
      - create_incident_channel: "#inc-{timestamp}"
      - page_oncall: immediately
      - notify_stakeholders: [engineering_lead, product]
      - start_war_room: zoom_link
      - create_status_page: incident
      
    high:
      - create_incident_channel
      - notify_oncall: slack
      - create_ticket: jira
      
  communication:
    internal:
      frequency: every_30_minutes
      channel: incident_channel
      template: |
        📊 *Incident Update*
        
        Status: {status}
        Impact: {impact}
        Next update: {next_update_time}
        
        Current actions:
        {action_items}
        
    external:
      channel: status_page
      template: customer_facing_update
      
  resolution:
    steps:
      - confirm_resolution
      - update_status_page: resolved
      - notify_stakeholders
      - schedule_postmortem
      - close_incident_channel: after_24h

Postmortem Template

postmortem_template:
  sections:
    summary:
      - incident_title
      - duration
      - severity
      - impact
      
    timeline:
      format: |
        | Time | Event |
        |------|-------|
        | {time} | {event} |
        
    root_cause:
      - what_happened
      - why_it_happened
      - contributing_factors
      
    impact:
      - users_affected
      - revenue_impact
      - sla_breach
      
    resolution:
      - how_it_was_fixed
      - time_to_detect
      - time_to_resolve
      
    action_items:
      format: |
        | Action | Owner | Due Date | Status |
        |--------|-------|----------|--------|
        
    lessons_learned:
      - what_went_well
      - what_went_poorly
      - lucky_breaks

Infrastructure Automation

Server Provisioning

provisioning_workflow:
  trigger: jira_ticket OR slack_request
  
  steps:
    1. validate_request:
        check: [budget_approval, security_review]
        
    2. create_infrastructure:
        terraform:
          - vpc
          - security_groups
          - ec2_instances
          - load_balancer
          
    3. configure_server:
        ansible:
          - base_configuration
          - security_hardening
          - monitoring_agent
          - application_setup
          
    4. validate:
        - health_check
        - security_scan
        - performance_baseline
        
    5. notify:
        slack: "✅ Server {hostname} is ready"
        include: [ssh_access, dashboard_link]

Scheduled Maintenance

maintenance_automation:
  tasks:
    certificate_renewal:
      schedule: "30 days before expiry"
      action:
        - request_new_cert: letsencrypt
        - deploy_cert
        - verify_ssl
        - notify: if_failure
        
    security_patching:
      schedule: "weekly"
      action:
        - check_updates
        - if_critical: immediate_patch
        - else: schedule_maintenance_window
        
    log_rotation:
      schedule: "daily"
      action:
        - rotate_logs
        - compress_old
        - upload_to_s3
        - delete_local: older_than_7_days
        
    backup_verification:
      schedule: "weekly"
      action:
        - restore_to_test_env
        - run_integrity_checks
        - report_status

Kubernetes Automation

K8s Workflows

kubernetes_automation:
  deployment:
    trigger: docker_image_pushed
    steps:
      - update_manifest: with_new_image_tag
      - apply_to_staging
      - run_tests
      - if_success: apply_to_production
      
  scaling:
    trigger: metric_threshold
    rules:
      - if: cpu > 80%
        action: scale_up
        max_replicas: 10
      - if: cpu < 20%
        action: scale_down
        min_replicas: 2
        
  rollback:
    trigger: health_check_failed
    action:
      - kubectl_rollout_undo
      - notify_team
      - create_incident

Output Example

Request: "Set up deployment notifications for GitHub Actions"

Output:

# GitHub Actions Deployment Notifications

## n8n Workflow

```yaml
trigger: GitHub Webhook
events: [workflow_run]

Notification Templates

Build Started:

🚀 *Deployment Started*

Branch: main
Commit: abc1234
Author: @developer
Triggered by: Push

[View Workflow](https://github.com/...)

Build Success:

✅ *Deployment Successful*

Environment: Production
Duration: 3m 42s
Version: v1.2.3

Changes:
• Feature X
• Bug fix Y

[View Deployment](https://app.example.com)

Build Failed:

❌ *Deployment Failed*

Stage: Test
Error: npm test failed

[View Logs](https://github.com/...)
[Retry](https://github.com/...)

Slack Integration

channel: "#deployments"
mention_on_failure: "@oncall"
thread_replies: true

---

*DevOps Automation Skill - Part of Claude Office Skills*

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

afrexai-observability-engine

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.

Archived SourceRecently Updated
Automation

excel-automation

No summary provided by upstream source.

Repository SourceNeeds Review
General

ppt-visual

No summary provided by upstream source.

Repository SourceNeeds Review
General

office-mcp

No summary provided by upstream source.

Repository SourceNeeds Review