Elasticsearch Analysis
Authentication
IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for ELASTICSEARCH_URL , ES_USER , or ES_PASSWORD in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.
MANDATORY: Statistics-First Investigation
NEVER dump raw logs. Always follow this pattern:
STATISTICS → SAMPLE → PATTERNS → CORRELATE
-
Statistics First - Know volume, error rate, and top patterns before sampling
-
Strategic Sampling - Choose the right strategy based on statistics
-
Pattern Extraction - Cluster similar errors to find root causes
-
Context Correlation - Investigate around anomaly timestamps
Available Scripts
All scripts are in .claude/skills/observability-elasticsearch/scripts/
PRIMARY INVESTIGATION SCRIPTS
get_statistics.py - ALWAYS START HERE
Comprehensive statistics with pattern extraction.
python .claude/skills/observability-elasticsearch/scripts/get_statistics.py [--index INDEX] [--time-range MINUTES]
Examples:
python .claude/skills/observability-elasticsearch/scripts/get_statistics.py --time-range 60 python .claude/skills/observability-elasticsearch/scripts/get_statistics.py --index logs-production
Output includes:
-
Total count, error count, error rate percentage
-
Status distribution (info, warn, error)
-
Top services/sources by log volume
-
Top error patterns (crucial for quick triage)
-
Actionable recommendation
sample_logs.py - Strategic Sampling
Choose the right sampling strategy based on statistics.
python .claude/skills/observability-elasticsearch/scripts/sample_logs.py --strategy STRATEGY [--index INDEX] [--limit N]
Strategies:
errors_only - Only error logs (default for incidents)
warnings_up - Warning and error logs
around_time - Logs around a specific timestamp
all - All log levels
Examples:
python .claude/skills/observability-elasticsearch/scripts/sample_logs.py --strategy errors_only --index logs-production python .claude/skills/observability-elasticsearch/scripts/sample_logs.py --strategy around_time --timestamp "2026-01-27T05:00:00Z" --window 5
Lucene Query Syntax
Basic Searches
Simple term
error
Phrase
"connection refused"
Field search
level:ERROR
Wildcard
message:timeout*
Multiple terms (implicit OR)
error warning
Required term (AND)
+error +timeout
Field Queries
Exact match
level:ERROR
Wildcard
host:web-*
Range (numeric)
status:[400 TO 599]
Range (dates)
@timestamp:[2024-01-15T10:00:00 TO 2024-01-15T11:00:00]
Exists
exists:error.stack_trace
Boolean Operators
AND
error AND timeout
OR
error OR warning
NOT
error NOT debug
Grouping
(error OR warning) AND service:api
Query DSL (JSON)
Match Query
{ "query": { "match": { "message": "connection error" } } }
Term Query (Exact Match)
{ "query": { "term": { "level": "ERROR" } } }
Bool Query (Compound)
{ "query": { "bool": { "must": [ {"term": {"level": "ERROR"}}, {"match": {"message": "timeout"}} ], "must_not": [ {"term": {"service": "healthcheck"}} ], "filter": [ {"range": {"@timestamp": {"gte": "now-1h"}}} ] } } }
Aggregations
{ "size": 0, "aggs": { "errors_by_service": { "terms": { "field": "service.keyword", "size": 10 } } } }
Investigation Workflow
Standard Incident Investigation
┌─────────────────────────────────────────────────────────────┐ │ 1. STATISTICS FIRST (mandatory) │ │ python get_statistics.py --index <index> │ │ → Know volume, error rate, top patterns │ └─────────────────────────────────────────────────────────────┘ │ ▼ High Error Rate? ┌─────────────┴─────────────┐ │ │ YES (>5%) NO │ │ ▼ ▼ ┌─────────────────────────────┐ ┌───────────────────────────────────────────┐ │ 2. FAST PATH │ │ 2. TARGETED INVESTIGATION │ │ Sample errors directly │ │ Filter by specific criteria │ │ python sample_logs.py │ │ python sample_logs.py --strategy all │ │ --strategy errors_only │ │ → Look for anomalies │ └─────────────────────────────┘ └───────────────────────────────────────────┘
Quick Commands Reference
Goal Command
Start investigation get_statistics.py --index X
Sample errors only sample_logs.py --strategy errors_only --index X
Investigate spike sample_logs.py --strategy around_time --timestamp T
All logs sample_logs.py --strategy all --index X --limit 20
Common Aggregation Patterns
Errors Over Time
{ "size": 0, "query": {"term": {"level": "ERROR"}}, "aggs": { "errors_over_time": { "date_histogram": { "field": "@timestamp", "fixed_interval": "5m" } } } }
Top Error Messages
{ "size": 0, "query": {"term": {"level": "ERROR"}}, "aggs": { "top_errors": { "terms": { "field": "message.keyword", "size": 10 } } } }
Nested Aggregation (Errors by Service, then by Message)
{ "size": 0, "aggs": { "by_service": { "terms": {"field": "service.keyword", "size": 10}, "aggs": { "by_message": { "terms": {"field": "message.keyword", "size": 5} } } } } }
Field Types
Keyword vs Text
-
keyword: Exact match, aggregatable (service.keyword )
-
text: Full-text search, not aggregatable (message )
// For aggregation, use .keyword suffix "terms": {"field": "service.keyword"}
// For full-text search, use text field "match": {"message": "connection error"}
Anti-Patterns to Avoid
-
❌ NEVER skip statistics - get_statistics.py is MANDATORY first step
-
❌ Unbounded queries - Always specify time ranges and limits
-
❌ Fetching all logs - Use sampling strategies, not unbounded searches
-
❌ Ignoring error rate - High error rate means immediate investigation
-
❌ Text field in aggregation - Use .keyword suffix for terms aggs
-
❌ Wildcard prefix - error is expensive, prefer error or exact match