Distributed Tracing
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
Do not use this skill when
-
The task is unrelated to distributed tracing
-
You need a different domain or tool outside this scope
Instructions
-
Clarify goals, constraints, and required inputs.
-
Apply relevant best practices and validate outcomes.
-
Provide actionable steps and verification.
-
If detailed examples are required, open resources/implementation-playbook.md .
Purpose
Track requests across distributed systems to understand latency, dependencies, and failure points.
Use this skill when
-
Debug latency issues
-
Understand service dependencies
-
Identify bottlenecks
-
Trace error propagation
-
Analyze request paths
Distributed Tracing Concepts
Trace Structure
Trace (Request ID: abc123) ↓ Span (frontend) [100ms] ↓ Span (api-gateway) [80ms] ├→ Span (auth-service) [10ms] └→ Span (user-service) [60ms] └→ Span (database) [40ms]
Key Components
-
Trace - End-to-end request journey
-
Span - Single operation within a trace
-
Context - Metadata propagated between services
-
Tags - Key-value pairs for filtering
-
Logs - Timestamped events within a span
Jaeger Setup
Kubernetes Deployment
Deploy Jaeger Operator
kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
Deploy Jaeger instance
kubectl apply -f - <<EOF apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger namespace: observability spec: strategy: production storage: type: elasticsearch options: es: server-urls: http://elasticsearch:9200 ingress: enabled: true EOF
Docker Compose
version: '3.8' services: jaeger: image: jaegertracing/all-in-one:latest ports: - "5775:5775/udp" - "6831:6831/udp" - "6832:6832/udp" - "5778:5778" - "16686:16686" # UI - "14268:14268" # Collector - "14250:14250" # gRPC - "9411:9411" # Zipkin environment: - COLLECTOR_ZIPKIN_HOST_PORT=:9411
Reference: See references/jaeger-setup.md
Application Instrumentation
OpenTelemetry (Recommended)
Python (Flask)
from opentelemetry import trace from opentelemetry.exporter.jaeger.thrift import JaegerExporter from opentelemetry.sdk.resources import SERVICE_NAME, Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.instrumentation.flask import FlaskInstrumentor from flask import Flask
Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"}) provider = TracerProvider(resource=resource) processor = BatchSpanProcessor(JaegerExporter( agent_host_name="jaeger", agent_port=6831, )) provider.add_span_processor(processor) trace.set_tracer_provider(provider)
Instrument Flask
app = Flask(name) FlaskInstrumentor().instrument_app(app)
@app.route('/api/users') def get_users(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# Business logic
users = fetch_users_from_db()
return {"users": users}
def fetch_users_from_db(): tracer = trace.get_tracer(name)
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# Database query
return query_database()
Node.js (Express)
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node'); const { JaegerExporter } = require('@opentelemetry/exporter-jaeger'); const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base'); const { registerInstrumentations } = require('@opentelemetry/instrumentation'); const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http'); const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
// Initialize tracer const provider = new NodeTracerProvider({ resource: { attributes: { 'service.name': 'my-service' } } });
const exporter = new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' });
provider.addSpanProcessor(new BatchSpanProcessor(exporter)); provider.register();
// Instrument libraries registerInstrumentations({ instrumentations: [ new HttpInstrumentation(), new ExpressInstrumentation(), ], });
const express = require('express'); const app = express();
app.get('/api/users', async (req, res) => { const tracer = trace.getTracer('my-service'); const span = tracer.startSpan('get_users');
try { const users = await fetchUsers(); span.setAttributes({ 'user.count': users.length }); res.json({ users }); } finally { span.end(); } });
Go
package main
import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/jaeger" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.4.0" )
func initTracer() (*sdktrace.TracerProvider, error) { exporter, err := jaeger.New(jaeger.WithCollectorEndpoint( jaeger.WithEndpoint("http://jaeger:14268/api/traces"), )) if err != nil { return nil, err }
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func getUsers(ctx context.Context) ([]User, error) { tracer := otel.Tracer("my-service") ctx, span := tracer.Start(ctx, "get_users") defer span.End()
span.SetAttributes(attribute.String("user.filter", "active"))
users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}
Reference: See references/instrumentation.md
Context Propagation
HTTP Headers
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 tracestate: congo=t61rcWkgMzE
Propagation in HTTP Requests
Python
from opentelemetry.propagate import inject
headers = {} inject(headers) # Injects trace context
response = requests.get('http://downstream-service/api', headers=headers)
Node.js
const { propagation } = require('@opentelemetry/api');
const headers = {}; propagation.inject(context.active(), headers);
axios.get('http://downstream-service/api', { headers });
Tempo Setup (Grafana)
Kubernetes Deployment
apiVersion: v1 kind: ConfigMap metadata: name: tempo-config data: tempo.yaml: | server: http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
apiVersion: apps/v1 kind: Deployment metadata: name: tempo spec: replicas: 1 template: spec: containers: - name: tempo image: grafana/tempo:latest args: - -config.file=/etc/tempo/tempo.yaml volumeMounts: - name: config mountPath: /etc/tempo volumes: - name: config configMap: name: tempo-config
Reference: See assets/jaeger-config.yaml.template
Sampling Strategies
Probabilistic Sampling
Sample 1% of traces
sampler: type: probabilistic param: 0.01
Rate Limiting Sampling
Sample max 100 traces per second
sampler: type: ratelimiting param: 100
Adaptive Sampling
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))
Trace Analysis
Finding Slow Requests
Jaeger Query:
service=my-service duration > 1s
Finding Errors
Jaeger Query:
service=my-service error=true tags.http.status_code >= 500
Service Dependency Graph
Jaeger automatically generates service dependency graphs showing:
-
Service relationships
-
Request rates
-
Error rates
-
Average latencies
Best Practices
-
Sample appropriately (1-10% in production)
-
Add meaningful tags (user_id, request_id)
-
Propagate context across all service boundaries
-
Log exceptions in spans
-
Use consistent naming for operations
-
Monitor tracing overhead (<1% CPU impact)
-
Set up alerts for trace errors
-
Implement distributed context (baggage)
-
Use span events for important milestones
-
Document instrumentation standards
Integration with Logging
Correlated Logs
import logging from opentelemetry import trace
logger = logging.getLogger(name)
def process_request(): span = trace.get_current_span() trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, '032x')}
)
Troubleshooting
No traces appearing:
-
Check collector endpoint
-
Verify network connectivity
-
Check sampling configuration
-
Review application logs
High latency overhead:
-
Reduce sampling rate
-
Use batch span processor
-
Check exporter configuration
Reference Files
-
references/jaeger-setup.md
-
Jaeger installation
-
references/instrumentation.md
-
Instrumentation patterns
-
assets/jaeger-config.yaml.template
-
Jaeger configuration
Related Skills
-
prometheus-configuration
-
For metrics
-
grafana-dashboards
-
For visualization
-
slo-implementation
-
For latency SLOs