QA Observability and Performance Engineering
Use telemetry (logs, metrics, traces, profiles) as a QA signal and a debugging substrate.
Core references (see data/sources.json ): OpenTelemetry, W3C Trace Context, and SLO practices (Google SRE).
Quick Start (Default)
If key context is missing, ask for: critical user journeys, service/dependency inventory, environments (local/staging/prod), current telemetry stack, and current SLO/SLA commitments (if any).
-
Establish the minimum bar: correlation IDs + structured logs + traces + golden metrics (latency, traffic, errors, saturation).
-
Verify propagation: confirm traceparent (and your request ID) flow across boundaries end-to-end.
-
Make failures diagnosable: every test failure captures a trace link (or trace ID) plus the correlated logs.
-
Define SLIs/SLOs and error budget policy; wire burn-rate alerts (prefer multi-window burn rates).
-
Produce artifacts: a readiness checklist plus an SLO definition and alert rules (use assets/checklists/template-observability-readiness-checklist.md and assets/monitoring/slo/* ).
Default QA stance
-
Treat telemetry as part of acceptance criteria (especially for integration/E2E tests).
-
Require correlation: request_id + trace_id (traceparent) across boundaries.
-
Prefer SLO-based release gating and burn-rate alerting over raw infra thresholds.
-
Budget overhead: sampling, cardinality, retention, and cost are quality constraints.
-
Redact PII/secrets by default (logs and attributes).
Core workflows
-
Establish the minimum bar (logs + metrics + traces + correlation).
-
Instrument with OpenTelemetry (auto-instrument first, then add manual spans for key paths).
-
Verify context propagation across service boundaries (traceparent in/out).
-
Define SLIs/SLOs and error budget policy; wire burn-rate alerts.
-
Make failures diagnosable: capture a trace link + key logs on every test failure.
-
Profile and load test only after telemetry is reliable; validate against baselines.
Quick reference
Task Recommended default Notes
Tracing OpenTelemetry + Jaeger/Tempo Prefer OTLP exporters via Collector when possible
Metrics Prometheus + Grafana Use histograms for latency; watch cardinality
Logging Structured JSON + correlation IDs Never log secrets/PII; redact aggressively
Reliability gates SLOs + error budgets + burn-rate alerts Gate releases on sustained burn/regressions
Performance Profiling + load tests + budgets Add continuous profiling for intermittent issues
Zero-code visibility eBPF (OpenTelemetry zero-code) + continuous profiling (Parca/Pyroscope) Use when code changes are not feasible
Navigation
Open these guides when needed:
If the user needs... Read Also use
A minimal, production-ready baseline references/core-observability-patterns.md
assets/checklists/template-observability-readiness-checklist.md
Node/Python instrumentation setup references/opentelemetry-best-practices.md
assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md , assets/opentelemetry/python/opentelemetry-python-setup.md
Working trace propagation across services references/distributed-tracing-patterns.md
assets/checklists/template-observability-readiness-checklist.md
SLOs, burn-rate alerts, and release gates references/slo-design-guide.md
assets/monitoring/slo/slo-definition.yaml , assets/monitoring/slo/prometheus-alert-rules.yaml
Profiling/load testing with evidence references/performance-profiling-guide.md
assets/load-testing/load-testing-k6.js , assets/load-testing/template-load-test-artillery.yaml
A maturity model and roadmap references/observability-maturity-model.md
assets/checklists/template-observability-readiness-checklist.md
What to avoid and how to fix it references/anti-patterns-best-practices.md
assets/checklists/template-observability-readiness-checklist.md
Alert design and fatigue reduction references/alerting-strategies.md
assets/monitoring/slo/prometheus-alert-rules.yaml
Dashboard hierarchy and layout references/dashboard-design-patterns.md
assets/monitoring/grafana/template-grafana-dashboard-observability.json
Structured logging and cost control references/log-aggregation-patterns.md
assets/observability/template-logging-setup.md
Implementation guides (deep dives):
-
references/core-observability-patterns.md
-
references/opentelemetry-best-practices.md
-
references/distributed-tracing-patterns.md
-
references/slo-design-guide.md
-
references/performance-profiling-guide.md
-
references/observability-maturity-model.md
-
references/anti-patterns-best-practices.md
-
references/alerting-strategies.md
-
references/dashboard-design-patterns.md
-
references/log-aggregation-patterns.md
Templates (copy/paste):
-
assets/checklists/template-observability-readiness-checklist.md
-
assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md
-
assets/opentelemetry/python/opentelemetry-python-setup.md
-
assets/monitoring/slo/slo-definition.yaml
-
assets/monitoring/slo/prometheus-alert-rules.yaml
-
assets/monitoring/grafana/grafana-dashboard-slo.json
-
assets/monitoring/grafana/template-grafana-dashboard-observability.json
-
assets/load-testing/load-testing-k6.js
-
assets/load-testing/template-load-test-artillery.yaml
-
assets/performance/frontend/template-lighthouse-ci.json
-
assets/performance/backend/template-nodejs-profiling-config.js
Curated sources:
- data/sources.json
Scope boundaries (handoffs)
-
Pure infrastructure monitoring (Kubernetes, Docker, CI/CD): ../ops-devops-platform/SKILL.md
-
Database query optimization (SQL tuning, indexing): ../data-sql-optimization/SKILL.md
-
Application-level debugging (stack traces, breakpoints): ../qa-debugging/SKILL.md
-
Test strategy design (coverage, test pyramids): ../qa-testing-strategy/SKILL.md
-
Resilience patterns (retries, circuit breakers): ../qa-resilience/SKILL.md
-
Architecture decisions (microservices, event-driven): ../software-architecture-design/SKILL.md
Tool selection notes (2026)
-
Default to OpenTelemetry + OTLP + Collector where possible.
-
Prefer burn-rate alerting against SLOs over alerting on raw infra metrics.
-
Treat sampling, cardinality, and retention as part of quality (not an afterthought).
-
When asked to pick vendors/tools, start from data/sources.json and validate time-sensitive claims with current docs/releases if the environment allows it.
Fact-Checking
-
Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
-
Prefer primary sources; report source links and dates for volatile information.
-
If web access is unavailable, state the limitation and mark guidance as unverified.