gcloud-usage

GCP Observability Best Practices

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gcloud-usage" with this command: npx skills add fcakyon/claude-codex-settings/fcakyon-claude-codex-settings-gcloud-usage

GCP Observability Best Practices

Structured Logging

JSON Log Format

Use structured JSON logging for better queryability:

{ "severity": "ERROR", "message": "Payment failed", "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" }, "labels": { "user_id": "123", "transaction_id": "abc" }, "timestamp": "2025-01-15T10:30:00Z" }

Severity Levels

Use appropriate severity for filtering:

  • DEBUG: Detailed diagnostic info

  • INFO: Normal operations, milestones

  • NOTICE: Normal but significant events

  • WARNING: Potential issues, degraded performance

  • ERROR: Failures that don't stop the service

  • CRITICAL: Failures requiring immediate action

  • ALERT: Person must take action immediately

  • EMERGENCY: System is unusable

Log Filtering Queries

Common Filters

By severity

severity >= WARNING

By resource

resource.type="cloud_run_revision" resource.labels.service_name="my-service"

By time

timestamp >= "2025-01-15T00:00:00Z"

By text content

textPayload =~ "error.*timeout"

By JSON field

jsonPayload.user_id = "123"

Combined

severity >= ERROR AND resource.labels.service_name="api"

Advanced Queries

Regex matching

textPayload =~ "status=[45][0-9]{2}"

Substring search

textPayload : "connection refused"

Multiple values

severity = (ERROR OR CRITICAL)

Metrics vs Logs vs Traces

When to Use Each

Metrics: Aggregated numeric data over time

  • Request counts, latency percentiles

  • Resource utilization (CPU, memory)

  • Business KPIs (orders/minute)

Logs: Detailed event records

  • Error details and stack traces

  • Audit trails

  • Debugging specific requests

Traces: Request flow across services

  • Latency breakdown by service

  • Identifying bottlenecks

  • Distributed system debugging

Alert Policy Design

Alert Best Practices

  • Avoid alert fatigue: Only alert on actionable issues

  • Use multi-condition alerts: Reduce noise from transient spikes

  • Set appropriate windows: 5-15 min for most metrics

  • Include runbook links: Help responders act quickly

Common Alert Patterns

Error rate:

  • Condition: Error rate > 1% for 5 minutes

  • Good for: Service health monitoring

Latency:

  • Condition: P99 latency > 2s for 10 minutes

  • Good for: Performance degradation detection

Resource exhaustion:

  • Condition: Memory > 90% for 5 minutes

  • Good for: Capacity planning triggers

Cost Optimization

Reducing Log Costs

  • Exclusion filters: Drop verbose logs at ingestion

  • Sampling: Log only percentage of high-volume events

  • Shorter retention: Reduce default 30-day retention

  • Downgrade logs: Route to cheaper storage buckets

Exclusion Filter Examples

Exclude health checks

resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"

Exclude debug logs in production

severity = DEBUG

Debugging Workflow

  • Start with metrics: Identify when issues started

  • Correlate with logs: Filter logs around problem time

  • Use traces: Follow specific requests across services

  • Check resource logs: Look for infrastructure issues

  • Compare baselines: Check against known-good periods

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

paper-search-usage

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

mongodb-usage

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

playwright-testing

No summary provided by upstream source.

Repository SourceNeeds Review