Datadog Observability
Overview
Datadog is a SaaS observability platform providing unified monitoring across infrastructure, applications, logs, and user experience. It offers AI-powered anomaly detection, 1000+ integrations, and OpenTelemetry compatibility.
Core Capabilities:
-
APM: Distributed tracing with automatic instrumentation for 8+ languages
-
Infrastructure: Host, container, and cloud service monitoring
-
Logs: Centralized collection with processing pipelines and 15-month retention
-
Metrics: Custom metrics via DogStatsD with cardinality management
-
Synthetics: Proactive API and browser testing from 29+ global locations
-
RUM: Frontend performance with Core Web Vitals and session replay
When to Use This Skill
Activate when:
-
Setting up production monitoring and observability
-
Implementing distributed tracing across microservices
-
Configuring log aggregation and analysis pipelines
-
Creating custom metrics and dashboards
-
Setting up alerting and anomaly detection
-
Optimizing Datadog costs
Do not use when:
-
Building with open-source stack (use Prometheus/Grafana instead)
-
Cost is primary concern and budget is limited
-
Need maximum customization over managed solution
Quick Start
- Install Datadog Agent
Docker (simplest):
docker run -d --name dd-agent
-e DD_API_KEY=<YOUR_API_KEY>
-e DD_SITE="datadoghq.com"
-v /var/run/docker.sock:/var/run/docker.sock:ro
-v /proc/:/host/proc/:ro
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
gcr.io/datadoghq/agent:7
Kubernetes (Helm):
helm repo add datadog https://helm.datadoghq.com
helm install datadog-agent datadog/datadog
--set datadog.apiKey=<YOUR_API_KEY>
--set datadog.apm.enabled=true
--set datadog.logs.enabled=true
- Instrument Your Application
Python:
from ddtrace import tracer, patch_all
Automatic instrumentation for common libraries
patch_all()
Manual span for custom operations
with tracer.trace("custom.operation", service="my-service") as span: span.set_tag("user.id", user_id) # your code here
Node.js:
// Must be first import const tracer = require('dd-trace').init({ service: 'my-service', env: 'production', version: '1.0.0', });
- Verify in Datadog UI
-
Go to Infrastructure > Host Map to verify agent
-
Go to APM > Services to see traced services
-
Go to Logs > Search to verify log collection
Core Concepts
Tagging Strategy
Tags enable filtering, aggregation, and cost attribution. Use consistent tags across all telemetry.
Required Tags:
Tag Purpose Example
env
Environment env:production
service
Service name service:api-gateway
version
Deployment version version:1.2.3
team
Owning team team:platform
Avoid High-Cardinality Tags:
-
User IDs, request IDs, timestamps
-
Pod IDs in Kubernetes
-
Build numbers, commit hashes
Unified Observability
Datadog correlates metrics, traces, and logs automatically:
-
Traces include span tags that link to metrics
-
Logs inject trace IDs for correlation
-
Dashboards combine all data sources
Best Practices
Start Simple
-
Install Agent with basic configuration
-
Enable automatic instrumentation
-
Verify data in Datadog UI
-
Add custom spans/metrics as needed
Progressive Enhancement
Basic → APM tracing → Custom spans → Custom metrics → Profiling → RUM
Key Instrumentation Points
-
HTTP entry/exit points
-
Database queries
-
External service calls
-
Message queue operations
-
Business-critical flows
Common Mistakes
-
High-cardinality tags: Using user IDs or request IDs as tags creates millions of unique metrics
-
Missing log index quotas: Leads to unexpected bills from log volume spikes
-
Over-alerting: Creates alert fatigue; alert on symptoms, not causes
-
Missing service tags: Prevents correlation between metrics, traces, and logs
-
No sampling for high-volume traces: Ingests everything, causing cost explosion
Navigation
For detailed implementation:
-
Agent Installation: Docker, Kubernetes, Linux, Windows, and cloud-specific setup
-
APM Instrumentation: Python, Node.js, Go, Java instrumentation with code examples
-
Log Management: Pipelines, Grok parsing, standard attributes, archives
-
Custom Metrics: DogStatsD patterns, metric types, tagging best practices
-
Alerting: Monitor types, anomaly detection, alert hygiene
-
Cost Optimization: Metrics without Limits, sampling, index quotas
-
Kubernetes: DaemonSet, Cluster Agent, autodiscovery
Complementary Skills
When using this skill, consider these related skills (if deployed):
-
docker: Container instrumentation patterns
-
kubernetes: K8s-native monitoring patterns
-
python/nodejs/go: Language-specific APM setup
Resources
Official Documentation:
-
Metrics: https://docs.datadoghq.com/metrics/
Cost Management: