Service Mesh Observability
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
When to Use This Skill
-
Setting up distributed tracing across services
-
Implementing service mesh metrics and dashboards
-
Debugging latency and error issues
-
Defining SLOs for service communication
-
Visualizing service dependencies
-
Troubleshooting mesh connectivity
Core Concepts
- Three Pillars of Observability
┌─────────────────────────────────────────────────────┐ │ Observability │ ├─────────────────┬─────────────────┬─────────────────┤ │ Metrics │ Traces │ Logs │ │ │ │ │ │ • Request rate │ • Span context │ • Access logs │ │ • Error rate │ • Latency │ • Error details │ │ • Latency P50 │ • Dependencies │ • Debug info │ │ • Saturation │ • Bottlenecks │ • Audit trail │ └─────────────────┴─────────────────┴─────────────────┘
- Golden Signals for Mesh
Signal Description Alert Threshold
Latency Request duration P50, P99 P99 > 500ms
Traffic Requests per second Anomaly detection
Errors 5xx error rate
1%
Saturation Resource utilization
80%
Templates
Template 1: Istio with Prometheus & Grafana
Install Prometheus
apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: istio-system data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'istio-mesh' kubernetes_sd_configs: - role: endpoints namespaces: names: - istio-system relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry
ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istio-mesh namespace: istio-system spec: selector: matchLabels: app: istiod endpoints: - port: http-monitoring interval: 15s
Template 2: Key Istio Metrics Queries
Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
P99 latency
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name))
TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
Request size
histogram_quantile(0.99, sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (le, destination_service_name))
Template 3: Jaeger Distributed Tracing
Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411
Jaeger deployment
apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: istio-system spec: selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - name: jaeger image: jaegertracing/all-in-one:1.50 ports: - containerPort: 5775 # UDP - containerPort: 6831 # Thrift - containerPort: 6832 # Thrift - containerPort: 5778 # Config - containerPort: 16686 # UI - containerPort: 14268 # HTTP - containerPort: 14250 # gRPC - containerPort: 9411 # Zipkin env: - name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411"
Template 4: Linkerd Viz Dashboard
Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
Access dashboard
linkerd viz dashboard
CLI commands for observability
Top requests
linkerd viz top deploy/my-app
Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
Service edges (dependencies)
linkerd viz edges deployment -n my-namespace
Template 5: Grafana Dashboard JSON
{ "dashboard": { "title": "Service Mesh Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Error Rate", "type": "gauge", "targets": [ { "expr": "sum(rate(istio_requests_total{response_code=~"5.."}[5m])) / sum(rate(istio_requests_total[5m])) * 100" } ], "fieldConfig": { "defaults": { "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 1, "color": "yellow" }, { "value": 5, "color": "red" } ] } } } }, { "title": "P99 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name))", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Service Topology", "type": "nodeGraph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter="destination"}[5m])) by (source_workload, destination_service_name)" } ] } ] } }
Template 6: Kiali Service Mesh Visualization
Kiali installation
apiVersion: kiali.io/v1alpha1 kind: Kiali metadata: name: kiali namespace: istio-system spec: auth: strategy: anonymous # or openid, token deployment: accessible_namespaces: - "**" external_services: prometheus: url: http://prometheus.istio-system:9090 tracing: url: http://jaeger-query.istio-system:16686 grafana: url: http://grafana.istio-system:3000
Template 7: OpenTelemetry Integration
OpenTelemetry Collector for mesh
apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-config data: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 zipkin: endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1 kind: Telemetry metadata: name: mesh-default namespace: istio-system spec: tracing: - providers: - name: otel randomSamplingPercentage: 10
Alerting Rules
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: mesh-alerts namespace: istio-system spec: groups: - name: mesh.rules rules: - alert: HighErrorRate expr: | sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name) / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"
Best Practices
Do's
-
Sample appropriately - 100% in dev, 1-10% in prod
-
Use trace context - Propagate headers consistently
-
Set up alerts - For golden signals
-
Correlate metrics/traces - Use exemplars
-
Retain strategically - Hot/cold storage tiers
Don'ts
-
Don't over-sample - Storage costs add up
-
Don't ignore cardinality - Limit label values
-
Don't skip dashboards - Visualize dependencies
-
Don't forget costs - Monitor observability costs