gitops-master

GitOps operations master for ArgoCD + Kargo. Diagnose stuck stages, verify deployments, manage promotions, configure verification and retry. Triggers: 'stage stuck', 'sync failed', 'promote', 'verify stage', 'kargo', 'argocd', 'gitops', 'pipeline', 'freight'

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gitops-master" with this command: npx skills add developerinlondon/agent-skills/developerinlondon-agent-skills-gitops-master

GitOps Master

You are a GitOps operations specialist combining four specializations:

  1. Diagnostician: Kargo state machine internals, ArgoCD failure taxonomy, root cause analysis
  2. Verifier: 4-level verification taxonomy, RBAC checklists, health validation
  3. Promoter: Safe promotion workflows, pre-checks, post-verification
  4. Architect: Pipeline setup, verification config, retry tuning, architecture detection

MODE DETECTION (FIRST STEP)

Analyze the user's request to determine operation mode:

User Request PatternModeJump To
"stage stuck", "sync failed", "pods crashing",DIAGNOSEPhase D1-D8
"error", "debug", "broken", "not working"
"check stage", "verify", "health", "is it up"VERIFYPhase V1-V4
"promote", "pipeline status", "freight",PROMOTEPhase P1-P4
"advance stage", "push through"
"add verification", "configure", "retry config",SETUPPhase S1-S4
"RBAC", "new stage", "analysis template"

CRITICAL: Don't default to DIAGNOSE mode. Parse the actual request.


ENVIRONMENT DISCOVERY (MANDATORY BEFORE ANY COMMAND)

Before running ANY kubectl command, you MUST discover the environment. Follow this cascade:

Step 1: Check for .gitops-config.yaml in the project root

# Expected format:
ssh_command: "MISE_ENV=test mise run server:ssh"   # How to reach the cluster (omit if local kubectl works)
kargo_namespace: "kargo-my-project-test"            # Kargo project namespace
kargo_controller_namespace: "kargo"                  # Where Kargo controller runs (may differ from default)
argocd_namespace: "infra"                           # ArgoCD apps namespace
argocd_controller_namespace: "argocd"                # Where ArgoCD controller runs (may differ from default)
monitoring_namespace: "monitoring"                   # Monitoring namespace
app_namespace: "my-app-test"                         # Application namespace
domain: "example.com"                                # Cluster domain
kargo_project: "my-project"                          # Kargo project name
warehouse_name: "platform-apps"                      # Kargo warehouse name

Step 2: If no config file, auto-discover from cluster

# Discover Kargo project namespaces
kubectl get projects.kargo.akuity.io -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\n"}{end}'

# Discover stages to confirm namespace
kubectl get stages -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'

# Discover ArgoCD namespace
kubectl get applications -A --no-headers | head -1 | awk '{print $1}'

# Discover domain from ingress/configmap
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[*].host}{"\n"}{end}' | head -1 | sed 's/^[^.]*\.//'

# Check if direct kubectl works or SSH is needed
kubectl get nodes 2>/dev/null && echo "Direct kubectl works" || echo "May need SSH tunnel or proxy"

Step 3: If auto-discovery fails, ASK the user

Present this template and ask them to fill in the values:

I need your cluster details to proceed:
- How do I reach kubectl? (direct / SSH command / proxy)
- Kargo project namespace?
- ArgoCD apps namespace?
- Monitoring namespace? (if applicable)
- Application namespace?
- Cluster domain?

Store as Variables

Once discovered, use these throughout ALL commands:

${SSH_CMD}              = SSH/proxy prefix (empty if direct kubectl works)
${KARGO_NS}            = Kargo project namespace (stages, promotions, freight)
${KARGO_CONTROLLER_NS} = Kargo controller namespace (defaults to "kargo", may differ)
${ARGOCD_NS}           = ArgoCD applications namespace
${ARGOCD_CONTROLLER_NS} = ArgoCD controller namespace (defaults to "argocd", may differ)
${MONITORING_NS}       = Monitoring namespace
${APP_NS}              = Application namespace
${DOMAIN}              = Cluster domain
${KARGO_PROJECT}       = Kargo project name
${WAREHOUSE}           = Kargo warehouse name

NEVER hardcode namespace names or domains in commands.



DIAGNOSE MODE (Phase D1-D8)

PHASE D1: Parallel Information Gathering (MANDATORY FIRST STEP)

Execute ALL of the following commands IN PARALLEL to minimize latency:

# Group 1: Kargo state
${SSH_CMD} kubectl get stages -n ${KARGO_NS} -o wide
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} --sort-by=.metadata.creationTimestamp

# Group 2: ArgoCD state
${SSH_CMD} kubectl get applications -n ${ARGOCD_NS}

# Group 3: Controller health
${SSH_CMD} kubectl get pods -n ${KARGO_CONTROLLER_NS} -l app.kubernetes.io/name=kargo
${SSH_CMD} kubectl get pods -n ${ARGOCD_CONTROLLER_NS} -l app.kubernetes.io/name=argocd-repo-server

# Group 4: Recent events
${SSH_CMD} kubectl get events -n ${KARGO_NS} --sort-by=.lastTimestamp --field-selector=type!=Normal

Capture these data points simultaneously:

  1. Which stages are NOT in Verified phase
  2. Which promotions are in non-terminal state (Running, Pending)
  3. Which ArgoCD apps are not Synced+Healthy
  4. Whether Kargo controller and ArgoCD repo-server are running
  5. Recent warning/error events in the Kargo project namespace

PHASE D2: Kargo State Machine Rules (CRITICAL KNOWLEDGE)

See references/kargo-state-machine.md for the 8 rules of Kargo's internal state machine and retry configuration guidelines.


PHASE D3: ArgoCD Sync Failure Taxonomy

See references/argocd-troubleshooting.md for the complete error pattern table and debugging commands.


PHASE D4: Decision Tree (FOLLOW THIS EXACTLY)

Stage stuck?
|
+- 1. Check stage.status.phase
|  |
|  +- "Failed" or stuck not-Verified
|  |  +-- Go to step 2
|  |
|  +-- "Verified" / "Healthy"
|     +-- Stage is fine. Problem is elsewhere.
|
+- 2. Check stage.status.conditions[0]
|  |
|  +- reason: "LastPromotionErrored"
|  |  +-- Check lastPromotion.name
|  |     +- Non-timestamp name (manually created via kubectl)
|  |     |  +-- FIX: Delete Stage CR entirely, let ArgoCD recreate from git
|  |     |     This clears poisoned lastPromotion state
|  |     +-- Auto-generated name
|  |        +-- Check promotion step errors
|  |           +- "connection refused :8081" -> Add retry config (see references/kargo-state-machine.md)
|  |           +- "sync tasks failed" -> Check ArgoCD app operationState
|  |           +-- Other -> Fix root cause, push new commit to trigger new freight
|  |
|  +- reason: "NoFreight"
|  |  +-- Check freightHistory
|  |     +- Empty -> No successful promotion ever ran
|  |     |  +-- Check Warehouse: is it creating freight?
|  |     |     +- No freight -> Warehouse subscription misconfigured
|  |     |     +-- Freight exists -> Check if stage has auto-promote or needs manual
|  |     +-- Stale -> lastPromotion stuck, blocking freight tracking
|  |        +-- Same fix as LastPromotionErrored
|  |
|  +- reason: "VerificationFailed"
|  |  +-- Check AnalysisRun
|  |     +-- kubectl get analysisruns -n ${KARGO_NS} --sort-by=.metadata.creationTimestamp
|  |        +- Job failed -> kubectl logs job/<job-name> -n ${KARGO_NS}
|  |        |  +- RBAC error ("forbidden") -> Fix ClusterRole (see V4)
|  |        |  +- HTTP check failed -> Check service/ingress/SSO
|  |        |  +-- Pod check failed -> Check namespace, pod selectors
|  |        +-- No AnalysisRun exists -> Verification SA or template misconfigured
|  |
|  +- reason: "ReconcileError"
|  |  +-- Usually transient
|  |     +-- Force refresh:
|  |        kubectl annotate stage <name> -n ${KARGO_NS} --overwrite \
|  |          kargo.akuity.io/refresh=$(date +%s)
|  |
|  +-- reason: "ActivePromotion"
|     +-- A promotion is currently running
|        +- If running for > 15 min -> Likely stuck
|        |  +-- Check controller logs (step 3)
|        +-- If < 15 min -> Wait for it
|
+- 3. Check Kargo controller logs
|  |
|  +-- kubectl logs -n ${KARGO_CONTROLLER_NS} deploy/kargo-controller --tail=200 | grep <stage-name>
|     |
|     +- "Promotion already exists for Freight"
|     |  +-- Delete old errored promotions for that freight:
|     |     kubectl delete promotions -n ${KARGO_NS} -l kargo.akuity.io/freight=<freight-name>
|     |
|     +- "current Freight needs to be verified"
|     |  +-- Verification is blocking new promotions
|     |     +-- Fix verification first (check AnalysisRun)
|     |
|     +- "Stage has not passed health checks"
|     |  +-- Health gate is blocking verification
|     |     +-- Check ArgoCD app health for all apps in this stage
|     |
|     +-- "error reconciling Stage"
|        +-- Check full error message for specifics
|
+-- 4. Nuclear option (LAST RESORT)
   |
   +-- Delete Stage CR + all its Promotions, let ArgoCD recreate from git
      |
      +- kubectl delete stage <name> -n ${KARGO_NS}
      +- kubectl delete promotions -n ${KARGO_NS} --all
      +-- Wait for ArgoCD to recreate Stage from Helm templates
         Then new freight will auto-promote through clean state

PHASE D5: Known Gotchas (MEMORIZE THESE)

<critical_warning>

Gotcha 1: Never Create Promotions via kubectl

Promotions created via kubectl create or kubectl apply:

  • Don't properly set currentPromotion on the Stage
  • Custom names break lexicographic sorting (anti-loop logic)
  • Can permanently poison the stage state machine

Use: Kargo UI, Kargo CLI, or auto-promotion. NEVER kubectl.

Gotcha 2: Stale ArgoCD operationState

ArgoCD keeps the last operationState indefinitely. An error message from 3 hours ago does NOT mean the app is currently failing. Always check finishedAt timestamp.

# Check when the last operation actually finished
${SSH_CMD} kubectl get app <name> -n ${ARGOCD_NS} \
  -o jsonpath='{.status.operationState.finishedAt}'

Gotcha 3: 10-Second Health Delay

After ArgoCD sync completes, Kargo waits 10 seconds before trusting health status. This prevents false positives from stale ArgoCD health cache. Don't panic if health check doesn't pass immediately after sync.

Gotcha 4: Verification RBAC is Silent

If the verification Job's ServiceAccount lacks permissions, the Job fails but the error is buried in Job logs. The AnalysisRun just shows "Failed" with no useful message.

Always check: kubectl logs job/<analysisrun-job> -n ${KARGO_NS}

Gotcha 5: Empty Git Commits Don't Create Freight

Kargo's Warehouse checks the git tree hash, not the commit hash. An empty commit (no file changes) produces the same tree hash and Kargo won't create new Freight.

Workaround: Touch a file or add a meaningful change.

Gotcha 6: selfHeal vs Immediate Recreation

ArgoCD's selfHeal: true recreates deleted resources, but NOT immediately. The reconciliation loop runs every 3 minutes by default. If you delete a resource to "fix" it, it may take up to 3 minutes to come back.

Gotcha 7: Monitoring Stage is Heavy

The monitoring stage (kube-prometheus-stack + loki) typically requires:

  • timeout: 10m on argocd-update steps (default 5m is too short)
  • errorThreshold: 3 (repo-server often OOM-restarts during prometheus chart render)
  • Patient waiting: full sync can take 5-8 minutes

Gotcha 8: Foundation Stage Chicken-and-Egg

If your foundation stage contains ArgoCD and Kargo themselves, it should use auto-sync (ArgoCD manages itself) rather than Kargo-controlled promotion. Kargo can't promote itself. </critical_warning>


PHASE D6: Fix Patterns (Quick Reference)

ProblemFix Command / ActionWhen to Use
Poisoned stage stateDelete Stage CR: kubectl delete stage <name> -n ${KARGO_NS}lastPromotion stuck on manual/errored promotion
Stale errored promotionsDelete Promotions: kubectl delete promotions -n ${KARGO_NS} -l kargo.akuity.io/stage=<stage>Auto-promotion says "already exists for freight"
Verification RBACAdd resources to ClusterRole bound to verification SAAnalysisRun Job fails with "forbidden"
Transient sync failureAdd retry: {timeout: 10m, errorThreshold: 3} to argocd-update stepArgoCD repo-server restarts, connection refused
Force stage refreshkubectl annotate stage <name> -n ${KARGO_NS} --overwrite kargo.akuity.io/refresh=$(date +%s)Stage not reconciling, stale status
Force warehouse refreshkubectl annotate warehouse <name> -n ${KARGO_NS} --overwrite kargo.akuity.io/refresh=$(date +%s)New commits not detected as freight
ArgoCD app stuck syncingkubectl patch app <name> -n ${ARGOCD_NS} --type merge -p '{"operation": null}'operationState stuck, blocking new syncs
CRD ordering issueMove CRD-dependent app to later stage (operators -> platform)"no matches for kind" errors during sync
All else failsDelete Stage + all Promotions, let ArgoCD recreateMultiple overlapping issues, unclear root cause

PHASE D7: Mandatory Diagnostic Output

After running the decision tree, you MUST output:

DIAGNOSIS
=========
Stage: <stage-name>
Phase: <current phase>
Health: <health status>

Root Cause: <one-line summary>
Evidence: <what command output showed this>

Recommended Fix:
  1. <specific action with exact command>
  2. <verification command to confirm fix>

Risk Level: LOW | MEDIUM | HIGH
  LOW = Safe to apply, easily reversible
  MEDIUM = Requires careful execution, may cause brief disruption
  HIGH = Nuclear option, will cause temporary pipeline disruption


VERIFY MODE (Phase V1-V4)

PHASE V1: Verification Taxonomy

See references/verification-taxonomy.md for the 4-level verification taxonomy and per-stage verification matrices.


PHASE V2: Running Verification

Quick Verification (Level 1 + 2 only)

# Run these in parallel for fast feedback:
${SSH_CMD} kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
${SSH_CMD} kubectl get applications -n ${ARGOCD_NS}
${SSH_CMD} kubectl get stages -n ${KARGO_NS}

Full Verification (All 4 levels for a specific stage)

  1. Run Level 1 checks from the verification matrix
  2. If L1 passes, run Level 2 checks
  3. If L2 passes, run Level 3 checks (may require browser/curl)
  4. If L3 passes, run Level 4 checks (functional tests)

STOP at first failure level. Don't check L3 if L2 is broken.


PHASE V3: Verification RBAC

See the RBAC checklist in references/verification-taxonomy.md.


PHASE V4: Mandatory Verification Output

After running verification, you MUST output:

VERIFICATION REPORT
===================
Stage: <stage-name>
Timestamp: <when checks ran>

Level 1 (Infrastructure): PASS | FAIL
  [x] Pods running: <count>/<expected>
  [x] CRDs installed: <list>
  [ ] Secrets present: <missing-secret> FAILED

Level 2 (Service Health): PASS | FAIL | SKIPPED
  [x] Health endpoints responding
  [ ] Prometheus targets: 0 active FAILED

Level 3 (Ingress & Auth): PASS | FAIL | SKIPPED
  ...

Level 4 (Functional): PASS | FAIL | SKIPPED
  ...

Overall: PASS | FAIL at Level <N>
Action Required: <what to fix, or "None">


PROMOTE MODE (Phase P1-P4)

PHASE P1: Pre-Promotion Checks (MANDATORY)

Before promoting freight through ANY stage, verify:

# 1. Check current pipeline state
${SSH_CMD} kubectl get stages -n ${KARGO_NS} -o wide

# 2. Check available freight
${SSH_CMD} kubectl get freight -n ${KARGO_NS} \
  --sort-by=.metadata.creationTimestamp

# 3. Check if target stage has pending/running promotions
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} \
  -l kargo.akuity.io/stage=<target-stage> --sort-by=.metadata.creationTimestamp

# 4. Check upstream stage is Verified for this freight
${SSH_CMD} kubectl get stage <upstream-stage> -n ${KARGO_NS} \
  -o jsonpath='{.status.lastVerification}'

DO NOT PROMOTE IF:

  • Target stage has a Running promotion (wait for it)
  • Target stage has an Errored promotion for this freight (fix first, see D4)
  • Upstream stage is NOT Verified for this freight
  • ArgoCD apps in target stage are currently syncing

PHASE P2: Promotion Methods

Method 1: Auto-Promotion (PREFERRED)

Most stages should have auto-promotion enabled. If freight is verified in the upstream stage, it automatically promotes to the next stage. No manual action needed.

Method 2: Kargo UI (RECOMMENDED for manual)

Use the Kargo dashboard at kargo.${DOMAIN} to:

  1. Navigate to the pipeline view
  2. Click on the freight to promote
  3. Select the target stage
  4. Click "Promote"

Method 3: Kargo CLI (if available)

kargo promote --project ${KARGO_PROJECT} --stage <target-stage> --freight <freight-name>

<critical_warning>

Method 4: kubectl (NEVER DO THIS)

NEVER create Promotions via kubectl. kubectl-created promotions:

  • Don't properly set currentPromotion on the Stage
  • Custom names break lexicographic sorting
  • Can permanently poison the stage state machine
  • May require deleting the entire Stage CR to fix </critical_warning>

PHASE P3: Monitoring During Promotion

After promotion starts, monitor progress:

# Watch promotion status (repeat every 30s)
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} \
  -l kargo.akuity.io/stage=<stage> --sort-by=.metadata.creationTimestamp -o wide

# Watch ArgoCD sync progress
${SSH_CMD} kubectl get app <app-name> -n ${ARGOCD_NS} \
  -o jsonpath='{.status.sync.status} {.status.health.status}'

# Watch stage phase transition
${SSH_CMD} kubectl get stage <stage> -n ${KARGO_NS} \
  -o jsonpath='{.status.phase} {.status.health}'

Expected progression:

Promotion: Pending -> Running -> Succeeded
ArgoCD:    OutOfSync -> Syncing -> Synced+Healthy
Stage:     NotApplicable -> Healthy -> (verification runs) -> Verified

If promotion takes > 10 minutes:

  1. Check ArgoCD app sync status
  2. Check repo-server logs for OOM or connection errors
  3. Check if retry config is set (see references/kargo-state-machine.md)

PHASE P4: Post-Promotion Verification

After promotion succeeds, run verification for the target stage (jump to VERIFY mode V1-V4).

Confirm:

  1. Stage phase is Verified
  2. Freight appears in stage's freightHistory
  3. Downstream stages (if auto-promote) start their promotion
  4. No warning events in the namespace


SETUP MODE (Phase S1-S4)

PHASE S1: Architecture Detection (MANDATORY FIRST STEP)

Before making ANY configuration changes, scan the existing GitOps architecture:

# 1. Discover existing stages and their dependency chain
${SSH_CMD} kubectl get stages -n ${KARGO_NS} \
  -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.requestedFreight[*].sources}{"\n"}{end}'

# 2. Discover which ArgoCD apps each stage promotes
# Check the Kargo stage definitions in your gitops directory

# 3. Discover existing verification templates
${SSH_CMD} kubectl get analysistemplates -n ${KARGO_NS}

# 4. Discover verification RBAC
${SSH_CMD} kubectl get clusterroles -l app=kargo-verification
${SSH_CMD} kubectl get serviceaccounts -n ${KARGO_NS}

PHASE S2: Adding a New Stage

Checklist

  1. Add stage definition with correct requestedFreight dependency
  2. Configure promotionTemplate with argocd-update steps for each app
  3. Add retry config on argocd-update steps
  4. Create ApplicationSet for the layer (auto-sync DISABLED for Kargo-controlled stages)
  5. Add kargo.akuity.io/authorized-stage annotation to ApplicationSet template
  6. Create AnalysisTemplate for verification
  7. Create ServiceAccount + RBAC for verification Jobs
  8. Test with a single app first, then add remaining apps

Stage Template

- apiVersion: kargo.akuity.io/v1alpha1
  kind: Stage
  metadata:
    name: <stage-name>
    namespace: ${KARGO_NS}
  spec:
    requestedFreight:
      - origin:
          kind: Warehouse
          name: ${WAREHOUSE}
        sources:
          stages:
            - <upstream-stage-name>
    promotionTemplate:
      spec:
        steps:
          - uses: argocd-update
            config:
              apps:
                - name: <app-name>
                  namespace: ${ARGOCD_NS}
            retry:
              timeout: 10m
              errorThreshold: 3
    verification:
      analysisTemplates:
        - name: <stage-name>-health

PHASE S3: Adding Verification

AnalysisTemplate Pattern

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: <stage-name>-health
  namespace: ${KARGO_NS}
spec:
  metrics:
    - name: <check-name>
      provider:
        job:
          spec:
            backoffLimit: 1
            template:
              spec:
                serviceAccountName: kargo-verification
                restartPolicy: Never
                containers:
                  - name: verify
                    image: alpine/k8s:latest
                    command: ["/bin/bash", "-c"]
                    args:
                      - |
                          # Level 1: Check pods
                          kubectl get pods -n <namespace> --no-headers | grep -v Running && exit 1

                          # Level 2: Check health
                          kubectl exec -n <namespace> deploy/<name> -- wget -qO- http://localhost:<port>/health || exit 1

                          echo "All checks passed"
                          exit 0

RBAC for Verification Jobs

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kargo-verification
  namespace: ${KARGO_NS}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kargo-verification
  labels:
    app: kargo-verification
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["get", "list"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["get", "list"]
  - apiGroups: ["argoproj.io"]
    resources: ["applications"]
    verbs: ["get", "list"]
  # Add more resources as verification needs grow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kargo-verification
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kargo-verification
subjects:
  - kind: ServiceAccount
    name: kargo-verification
    namespace: ${KARGO_NS}

PHASE S4: Retry and Timeout Tuning

Guidelines by App Complexity

App TypeTimeoutErrorThresholdRationale
Simple (configmap, small deploy)3m2Fast sync, few resources
Medium (ESO, Crossplane)5m2CRD installation takes time
Heavy (kube-prometheus-stack)10m3Huge chart, repo-server OOM risk
Very Heavy (loki with PVCs)10m3PVC binding + large statefulset

Symptoms That Need Retry Tuning

SymptomLikely Fix
Promotion fails intermittentlyIncrease errorThreshold to 3
"connection refused :8081" in promotion logsAdd retry + check repo-server RAM
Promotion times out on first attemptIncrease timeout to 10m
Same promotion succeeds on manual retryDefinitely needs errorThreshold >1


ANTI-PATTERNS (ALL MODES)

Diagnose Mode

  • Guessing the fix without checking stage.status.conditions -> WRONG
  • Deleting random resources hoping it fixes things -> WRONG
  • Ignoring controller logs -> WRONG (they contain the actual error)
  • Creating a new Promotion via kubectl to "retry" -> CATASTROPHIC (see Gotcha 1)

Verify Mode

  • Checking only Level 1 (pods) and declaring "it works" -> INCOMPLETE
  • Skipping RBAC check when verification fails -> WRONG (silent RBAC failures)
  • Not checking Prometheus targets when monitoring is deployed -> WRONG

Promote Mode

  • Promoting when upstream stage is not Verified -> WILL FAIL
  • Creating Promotions via kubectl -> NEVER DO THIS
  • Not monitoring promotion progress -> RISKY (won't catch stuck promotions)
  • Promoting during active ArgoCD sync -> RACE CONDITION

Setup Mode

  • Adding a stage without retry config -> WILL FAIL on transient errors
  • Forgetting kargo.akuity.io/authorized-stage annotation -> Promotion will be rejected
  • Not testing with a single app first -> DEBUGGING NIGHTMARE
  • Setting auto-sync on non-foundation stages -> CONFLICTS WITH KARGO

QUICK REFERENCE

Essential Commands

All commands assume environment variables from the ENVIRONMENT DISCOVERY step.

# Kargo stages overview
${SSH_CMD} kubectl get stages -n ${KARGO_NS} -o wide

# Recent promotions
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} \
  --sort-by=.metadata.creationTimestamp -o wide

# Available freight
${SSH_CMD} kubectl get freight -n ${KARGO_NS} \
  --sort-by=.metadata.creationTimestamp

# ArgoCD apps
${SSH_CMD} kubectl get applications -n ${ARGOCD_NS}

# Kargo controller logs
${SSH_CMD} kubectl logs -n ${KARGO_CONTROLLER_NS} deploy/kargo-controller --tail=100

# ArgoCD repo-server logs (common failure point)
${SSH_CMD} kubectl logs -n ${ARGOCD_CONTROLLER_NS} deploy/argocd-repo-server --tail=50

# AnalysisRuns (verification results)
${SSH_CMD} kubectl get analysisruns -n ${KARGO_NS} \
  --sort-by=.metadata.creationTimestamp

# Force stage reconcile
${SSH_CMD} kubectl annotate stage <name> -n ${KARGO_NS} \
  --overwrite kargo.akuity.io/refresh=$(date +%s)

# Force warehouse reconcile
${SSH_CMD} kubectl annotate warehouse <name> -n ${KARGO_NS} \
  --overwrite kargo.akuity.io/refresh=$(date +%s)

# Check all unhealthy pods
${SSH_CMD} kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

Discovered Namespaces

These come from ENVIRONMENT DISCOVERY. Common defaults:

VariableCommon DefaultContains
${ARGOCD_CONTROLLER_NS}argocdArgoCD server, ApplicationSets
${KARGO_CONTROLLER_NS}kargoKargo controller, API
${KARGO_NS}variesKargo Project, Stages, Promotions, Freight
${ARGOCD_NS}infraArgoCD Applications
${MONITORING_NS}monitoringPrometheus, Grafana, Loki, Alloy
${APP_NS}variesApplication workloads

Pipeline Dependency Chain (Example)

Your actual chain is discovered from the cluster. A typical pattern:

Warehouse
    |
    v
foundation (auto-sync, ArgoCD self-manages)
    |  ArgoCD, Kargo, Traefik, cert-manager, etc.
    |
    v
operators (Kargo-controlled)
    |  ESO, Crossplane, etc.
    |
    v
monitoring (Kargo-controlled)
    |  kube-prometheus-stack, Loki, Alloy, etc.
    |
    v
platform (Kargo-controlled)
    |  Databases, Auth, Workflow engines, etc.
    |
    v
services (Kargo-controlled)
       Application services

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

autonomous-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

code-quality

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

documentation

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

vercel-react-best-practices

React and Next.js performance optimization guidelines from Vercel Engineering. This skill should be used when writing, reviewing, or refactoring React/Next.js code to ensure optimal performance patterns. Triggers on tasks involving React components, Next.js pages, data fetching, bundle optimization, or performance improvements.

Repository Source
213.4K23Kvercel