slo-sli-error-budget

SLOs, SLIs, and Error Budgets

Patterns and practices for defining service level objectives, selecting meaningful indicators, and managing reliability through error budgets.

When to Use This Skill

Defining SLOs for services
Selecting appropriate SLIs
Implementing error budget policies
Balancing reliability and velocity
Setting up SLO-based alerting

Core Concepts

SLI (Service Level Indicator)

SLI = Quantitative measure of service level

What to measure:

Availability: % of successful requests
Latency: % of requests faster than threshold
Throughput: Requests per second
Error rate: % of failed requests

Formula: SLI = (good events / total events) × 100%

Example: Availability SLI = (successful requests / total requests) × 100% = (99,500 / 100,000) × 100% = 99.5%

SLO (Service Level Objective)

SLO = Target value for an SLI

Format: SLI >= Target over Time Window

Examples:

99.9% of requests successful over 30 days
95% of requests complete in <200ms over 7 days
99.95% availability measured monthly

Components: ┌─────────────────────────────────────────────────────┐ │ SLO = SLI + Target + Time Window │ │ │ │ "99.9% of HTTP requests return non-5xx │ │ over a rolling 30-day window" │ └─────────────────────────────────────────────────────┘

Error Budget

Error Budget = Allowed unreliability

If SLO = 99.9% availability: Error Budget = 100% - 99.9% = 0.1%

Over 30 days: Total minutes = 30 × 24 × 60 = 43,200 Error budget = 43,200 × 0.001 = 43.2 minutes

Or in requests (assuming 1M requests/month): Error budget = 1,000,000 × 0.001 = 1,000 failed requests

Budget consumption: ┌────────────────────────────────────────────────────┐ │ Error Budget Remaining: 65% │ │ ████████████████████░░░░░░░░░░ │ │ Consumed: 35% (15 min of 43.2 min) │ │ Days remaining in window: 12 │ └────────────────────────────────────────────────────┘

Selecting SLIs

SLI Categories

Request-based SLIs:
- Availability (success rate)
- Latency (response time)
- Quality (correct responses)
Processing-based SLIs:
- Throughput
- Freshness (data staleness)
- Coverage (% of data processed)
Storage-based SLIs:
- Durability
- Availability of data

SLI Selection Framework

For each user journey:

Identify critical interactions └── What does the user care about?
Map to measurable signals └── What can we measure?
Define good vs bad └── What's acceptable?
Validate with users/stakeholders └── Does this match expectations?

Good SLI Characteristics

✅ Good SLIs:

Directly reflect user experience
Measurable and observable
Simple to understand
Actionable when violated

❌ Bad SLIs:

Internal metrics (CPU, memory)
Too complex to explain
Can't be measured reliably
No clear good/bad threshold

SLI Examples by Service Type

API Service:

Availability: % non-5xx responses
Latency: % requests < 200ms
Quality: % valid responses

Data Pipeline:

Freshness: % data < 10 min old
Coverage: % records processed
Correctness: % matching expected

Storage Service:

Durability: % objects not lost
Availability: % successful reads
Latency: % reads < 50ms

Web Application:

Page Load: % pages < 3 seconds
Interactivity: % interactions < 100ms
Core Web Vitals: LCP, FID, CLS

Setting SLO Targets

Target Selection Approach

Measure current performance: What's the baseline?
Understand user expectations: What do users need?
Consider business constraints: What's the cost of reliability?
Start conservative: Better to exceed than miss
Iterate based on data: Adjust as you learn

SLO Target Guidelines

Service Type	Availability	Latency (p99)
Internal APIs	99.5%	500ms
External APIs	99.9%	200ms
Payment systems	99.99%	300ms
Static content	99.95%	100ms
Batch processing	99%	-

The "nines" scale: 99% = 7.2 hours/month downtime 99.9% = 43.8 minutes/month 99.99% = 4.38 minutes/month

Time Windows

Rolling vs Calendar:

Rolling (recommended):

30-day rolling window
Smooth, no cliff effects
Always relevant

Calendar:

Monthly reset
Aligns with business cycles
Creates "budget reset" behavior

Window selection: Short (7 days): More sensitive, faster feedback Long (30 days): More stable, smoother trends

Error Budget Policies

Policy Components

Error Budget Policy defines:

When to take action └── Budget thresholds (75%, 50%, 25%, 0%)
What actions to take └── Freeze features, focus on reliability
Who decides └── Team, management, escalation path
How to recover └── Steps to restore budget

Example Policy

Error Budget Policy for OrderService

Budget Remaining	Actions Required

50% | Normal development, deploy freely 25-50% | Review deployments, increase testing 10-25% | Freeze non-critical features < 10% | All hands on reliability, no new features 0% (exhausted) | Postmortem required, leadership review

Escalation:

Budget < 25%: Alert team lead
Budget < 10%: Alert engineering manager
Budget exhausted: Incident declared

Budget Recovery

When budget exhausted:

Stop non-critical deployments
Focus on stability improvements
Conduct thorough postmortems
Implement preventive measures
Resume normal work when budget recovers

Budget recovers through:

Time passing (rolling window)
Improved reliability
SLO adjustment (if appropriate)

Multi-Window SLOs

Why Multiple Windows?

Single window problems:

Long window: Slow to detect issues
Short window: Too sensitive to spikes

Solution: Multiple windows

Fast burn: Short window (1 hour)

Detects major outages quickly
High urgency alerts

Slow burn: Long window (30 days)

Detects gradual degradation
Lower urgency, more context

Multi-Window Configuration

Alert configuration:

Fast burn (page immediately):

2% of 30-day budget burned in 1 hour
5% of 30-day budget burned in 6 hours

Slow burn (ticket):

10% of 30-day budget burned in 3 days
20% of 30-day budget burned in 7 days

Calculation: If 30-day budget = 43.2 minutes 2% in 1 hour = 0.864 minutes = 52 seconds of errors → Significant outage, page immediately

SLO-Based Alerting

Alert Design

Traditional alerting:

CPU > 80% → Alert
Error rate > 1% → Alert
Latency > 500ms → Alert → Often noisy, may not reflect user impact

SLO-based alerting:

Error budget burn rate too high → Alert → Directly tied to user impact → Fewer, more meaningful alerts

Burn Rate Calculation

Burn rate = Rate of budget consumption

If budget should last 30 days: Normal burn rate = 1x (consuming 3.33%/day)

Fast burn rate = 14.4x → Burning 48%/day → 0 in 2 days → PAGE: Major incident

Slow burn rate = 3x → Burning 10%/day → 0 in 10 days → TICKET: Needs attention

Dashboard Design

Key Metrics to Display

SLO Dashboard Components:

Current SLI value └── "99.85% availability (target: 99.9%)"
Error budget remaining └── Bar chart with thresholds
Burn rate trend └── Line chart over time
Time to budget exhaustion └── "At current rate: 15 days"
Historical SLO compliance └── How often have we met SLO?
Key error contributors └── What's consuming budget?

Visualization Example

┌─────────────────────────────────────────────────────┐ │ OrderService SLO Dashboard │ ├─────────────────────────────────────────────────────┤ │ Availability SLI: 99.87% Target: 99.9% ⚠️ │ │ ██████████████████████████████████████░░░░ 99.87% │ │ │ │ Error Budget (30 day): │ │ ████████████████░░░░░░░░░░░░░░ 55% remaining │ │ Consumed: 19.4 min / 43.2 min │ │ │ │ Burn Rate: 1.3x (slight overage) │ │ ──────────────────────────────── │ │ ↑ now │ │ │ │ Top Budget Consumers: │ │ 1. Database timeouts (8.2 min) │ │ 2. Payment gateway errors (5.1 min) │ │ 3. Rate limiting (3.8 min) │ └─────────────────────────────────────────────────────┘

Implementation Checklist

Getting Started

Common Pitfalls

Too many SLOs → Focus on 3-5 critical SLOs
Unrealistic targets → Start achievable, tighten over time
Internal metrics as SLIs → Use user-facing metrics
No error budget policy → Policy makes SLOs actionable
Alert on SLI directly → Alert on burn rate instead

Best Practices

User-centric SLIs Measure what users experience
Conservative initial targets Better to exceed than miss
Documented error budget policy Everyone knows the rules
Regular SLO reviews Quarterly review and adjustment
Blameless culture Focus on learning, not blame
Automated tracking SLI/SLO calculation must be reliable

Related Skills

observability-patterns
Metrics and monitoring
distributed-tracing
Trace-based SLIs
incident-response
Using SLOs in incidents

slo-sli-error-budget

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering