Redis Cluster Analyzer

Analyze Redis Sentinel and Cluster configurations for high availability, performance, and memory efficiency. Reviews sentinel topology, cluster slot distribution, replication health, memory policies, persistence settings, connection pooling, and key design patterns. Acts as a senior infrastructure engineer auditing your Redis deployment for production readiness.

Usage

Invoke this skill when you need to review Redis configurations, validate HA setup, optimize memory usage, or troubleshoot failover issues.

Basic invocation:

Analyze the Redis configuration files in /etc/redis/ Review this Redis Sentinel setup for high availability Audit Redis Cluster configuration for production readiness

Focused analysis:

Check memory policies and eviction strategy Audit sentinel failover configuration Review cluster slot distribution for hotspots Analyze connection pooling settings

The agent reads Redis configuration files, sentinel configs, cluster node definitions, and application connection code, then produces a comprehensive quality report.

How It Works

Step 1: Discover and Parse Redis Configuration

The agent locates all Redis-related configuration:

# Find Redis configuration files
find /etc/redis/ -name "*.conf" -type f
find /path/to/project/ -name "redis*.conf" -o -name "sentinel*.conf"

# Find application Redis connection code
grep -rl "Redis\|redis\|ioredis\|redis-py\|RedisCluster" /path/to/app/ --include="*.py" --include="*.ts" --include="*.js" --include="*.go"

# Check running Redis processes
redis-cli INFO server 2>/dev/null
redis-cli CLUSTER INFO 2>/dev/null
redis-cli SENTINEL masters 2>/dev/null

The agent parses each configuration to extract:

Server configuration (bind, port, requirepass, maxclients)
Memory settings (maxmemory, maxmemory-policy, lazyfree)
Persistence (RDB snapshots, AOF, hybrid)
Replication (replica-of, repl-backlog-size, min-replicas)
Sentinel topology (masters, quorum, failover-timeout)
Cluster settings (cluster-enabled, node-timeout, migration-barrier)
Connection pool config (from application code)
Key patterns (TTL, naming conventions, data structures)

Step 2: Audit Sentinel Configuration

For Redis Sentinel deployments, the agent checks HA topology:

Sentinel Topology Analysis:

  Sentinels: 3 nodes
    sentinel-1: 10.0.1.10:26379
    sentinel-2: 10.0.1.11:26379
    sentinel-3: 10.0.1.12:26379

  Monitored Masters: 2
    mymaster: 10.0.1.20:6379 (2 replicas)
    cachemaster: 10.0.1.30:6379 (1 replica)

  PASS: 3 sentinels — meets minimum for quorum (need N/2 + 1)
  PASS: Sentinels on separate hosts — survives single-node failure

  FAIL: Master "cachemaster" has only 1 replica
    RISK: If replica fails, no failover target available
    During maintenance on the single replica, master has zero redundancy
    FIX: Add at least 1 more replica for cachemaster

  FAIL: Sentinels co-located with Redis nodes
    sentinel-1 (10.0.1.10) hosts both sentinel and Redis replica
    RISK: Node failure takes out both sentinel and data node
    FIX: Run sentinels on independent infrastructure

Sentinel configuration audit:

Sentinel Config Analysis:

  Master "mymaster":
    sentinel monitor mymaster 10.0.1.20 6379 2
    sentinel down-after-milliseconds mymaster 5000
    sentinel failover-timeout mymaster 60000
    sentinel parallel-syncs mymaster 1

  FAIL: down-after-milliseconds = 5000 (5 seconds)
    Too aggressive — network blips trigger unnecessary failovers
    Each failover causes ~10s of write unavailability
    FIX: sentinel down-after-milliseconds mymaster 30000 (30 seconds)
    Balances detection speed vs. false positive failovers

  WARN: failover-timeout = 60000 (60 seconds)
    If failover fails, retry waits 120 seconds (2x timeout)
    Total downtime in worst case: 3+ minutes
    CONSIDER: failover-timeout 180000 (3 min) for complex resyncs
    Prevents premature failover abort during large dataset sync

  WARN: parallel-syncs = 1
    Only 1 replica syncs from new master at a time after failover
    With 3 replicas, full sync takes 3x single replica sync time
    FIX: parallel-syncs = 2 (if replicas can handle sync load)
    Tradeoff: Faster recovery vs. higher load on new master during sync

  FAIL: No sentinel auth-pass configured
    Sentinels connect to master without authentication
    RISK: Unauthorized sentinel can trigger failover
    FIX: sentinel auth-pass mymaster <password>

  FAIL: No sentinel notification-script configured
    No alerting on failover events
    FIX: sentinel notification-script mymaster /opt/redis/notify.sh
    Script receives: event-type, event-description
    Hook into PagerDuty/Slack for operational awareness

  FAIL: No sentinel client-reconfig-script configured
    Application does not know about master change
    FIX: sentinel client-reconfig-script mymaster /opt/redis/reconfig.sh
    OR: Use Sentinel-aware client library (recommended):
      redis-py: Redis.from_url("redis+sentinel://...")
      ioredis: new Redis({ sentinels: [...], name: "mymaster" })

Step 3: Analyze Cluster Configuration

For Redis Cluster deployments, the agent checks slot distribution and node health:

Cluster Topology Analysis:

  Nodes: 6 (3 masters, 3 replicas)
    master-1: 10.0.1.50:6379 — slots 0-5460 (5461 slots)
    master-2: 10.0.1.51:6379 — slots 5461-10922 (5462 slots)
    master-3: 10.0.1.52:6379 — slots 10923-16383 (5461 slots)
    replica-1: 10.0.1.60:6379 — replicates master-1
    replica-2: 10.0.1.61:6379 — replicates master-2
    replica-3: 10.0.1.62:6379 — replicates master-3

  PASS: Slots evenly distributed (5461/5462/5461)
  PASS: Each master has at least 1 replica
  PASS: Replicas on different hosts than their masters

Cluster configuration audit:

Cluster Config Analysis:

  FAIL: cluster-node-timeout = 15000 (15 seconds)
    Too aggressive for cross-AZ deployments
    Network latency spikes between AZs can trigger false failovers
    FIX: cluster-node-timeout 30000 for cross-AZ
    Keep 15000 for single-AZ deployments

  FAIL: cluster-migration-barrier = 1 (default)
    Master with only 1 replica won't donate replica to orphaned master
    If a master loses all replicas, no automatic migration occurs
    FIX: cluster-migration-barrier 0 — allow replica migration when needed
    NOTE: Only effective if some masters have 2+ replicas

  FAIL: cluster-require-full-coverage = yes (default)
    If any slot range has no master, ENTIRE cluster stops accepting writes
    RISK: Single master failure can take down whole cluster
    FIX: cluster-require-full-coverage no
    Allows cluster to serve keys in available slot ranges

  WARN: cluster-allow-reads-when-down = no (default)
    Cluster rejects all operations when marked as down
    FIX: cluster-allow-reads-when-down yes
    Allows read operations during partial failures (stale reads possible)

  WARN: No cluster-announce-ip configured
    In Docker/NAT environments, nodes advertise internal IPs
    Clients outside the network cannot connect
    FIX: Set cluster-announce-ip to external/routable IP

  FAIL: All nodes in same availability zone
    master-1, master-2, master-3 all in us-east-1a
    RISK: AZ failure takes down entire cluster
    FIX: Distribute across 3 AZs:
      AZ-a: master-1, replica-2
      AZ-b: master-2, replica-3
      AZ-c: master-3, replica-1

Key distribution analysis:

Slot Hotspot Analysis:

  WARN: Uneven key distribution detected
    master-1 (slots 0-5460): 2.1M keys, 1.8 GB
    master-2 (slots 5461-10922): 850K keys, 600 MB
    master-3 (slots 10923-16383): 3.2M keys, 2.4 GB

  FAIL: master-3 has 3.8x more keys than master-2
    Likely cause: Hot key prefix hashes to slot range 10923-16383
    Common pattern: All "user:{id}" keys hash similarly
    FIX: Use hash tags to control distribution:
      user:{12345} — hashes on "12345", distributed
      {user}:12345 — hashes on "user", all same slot (BAD)
    OR: Reshard slots to balance memory across masters

  WARN: Large key detected
    Key "cache:product_catalog" — 450 MB (hash with 100K fields)
    RISK: Migration of this key's slot blocks cluster operations
    FIX: Split into smaller keys using key prefixing:
      cache:product_catalog:{category_id}

Step 4: Review Memory Configuration

The agent audits memory settings:

Memory Configuration Analysis:

  FAIL: maxmemory not set
    Redis will use all available system memory
    RISK: OOM killer terminates Redis process — data loss
    FIX: maxmemory 4gb (set to ~75% of available RAM)
    Reserve 25% for fork operations (RDB/AOF), OS, and buffer

  FAIL: maxmemory-policy = noeviction (default)
    When maxmemory reached, all write operations return OOM error
    RISK: Application crashes when cache is full
    FIX: Choose policy based on workload:
      allkeys-lru — cache workload, evict least recently used
      volatile-lru — mixed workload, only evict keys with TTL
      allkeys-lfu — frequency-based, better hit rate than LRU
      volatile-ttl — evict keys closest to expiry

  WARN: maxmemory-samples = 5 (default)
    LRU/LFU approximation uses 5 samples — may evict suboptimally
    FIX: maxmemory-samples 10 — better eviction accuracy, minimal CPU cost

  FAIL: lazyfree-lazy-eviction = no
    Eviction blocks the main thread — large key eviction causes latency spike
    FIX: lazyfree-lazy-eviction yes
    AND: lazyfree-lazy-expire yes
    AND: lazyfree-lazy-server-del yes
    AND: lazyfree-lazy-user-del yes
    Lazy-free delegates memory reclamation to background thread

  Memory Efficiency:
    Used memory: 3.2 GB
    Peak memory: 3.8 GB
    Fragmentation ratio: 1.42
    WARN: Fragmentation ratio > 1.2 — 42% wasted memory
    FIX: Set activedefrag yes (Redis 4.0+)
      active-defrag-enabled yes
      active-defrag-threshold-lower 10
      active-defrag-threshold-upper 100
      active-defrag-cycle-min 1
      active-defrag-cycle-max 25

Step 5: Audit Persistence Settings

The agent checks data durability configuration:

Persistence Analysis:

  RDB Snapshots:
    save 900 1      — snapshot if 1 change in 15 min
    save 300 10     — snapshot if 10 changes in 5 min
    save 60 10000   — snapshot if 10000 changes in 1 min

  WARN: RDB snapshot frequency may be too aggressive
    save 60 10000 causes fork every 60 seconds under write load
    Fork on 4 GB dataset copies page tables — 100-500ms freeze
    FIX: For cache-only workloads, disable RDB:
      save ""
    For durability, prefer AOF over frequent RDB

  AOF Configuration:
    appendonly no

  FAIL: AOF disabled — data loss window = RDB interval
    With save 300 10, up to 5 minutes of data lost on crash
    FIX: appendonly yes
    appendfsync everysec (good balance of performance and durability)
    auto-aof-rewrite-percentage 100
    auto-aof-rewrite-min-size 64mb

  FAIL: No hybrid persistence (RDB + AOF)
    Redis 4.0+ supports aof-use-rdb-preamble for faster restart
    FIX: aof-use-rdb-preamble yes
    AOF file starts with RDB snapshot, followed by append-only log
    Combines fast restart (RDB) with minimal data loss (AOF)

  WARN: stop-writes-on-bgsave-error = yes
    If RDB snapshot fails, Redis stops accepting writes
    Correct for primary data store, too strict for cache
    FIX: For cache workloads: stop-writes-on-bgsave-error no

  Replica Persistence:
    FAIL: Replicas have RDB enabled (same schedule as master)
    RISK: Fork storm — master and all replicas fork simultaneously
    FIX: Disable RDB on replicas, rely on replication + master RDB:
      replica: save ""
    Exception: Enable on ONE replica for backup purposes

Step 6: Review Connection and Client Settings

The agent checks connection configuration:

Connection Analysis:

  FAIL: maxclients = 10000 (default)
    System file descriptor limit: 1024 (ulimit -n)
    Redis needs: maxclients + 32 (internal FDs) = 10032
    But only 1024 FDs available — effective maxclients = 992
    FIX: Increase system limit:
      /etc/security/limits.conf: redis soft nofile 65535
    OR: Reduce maxclients to match available FDs

  FAIL: timeout = 0 (no idle timeout)
    Idle connections never close — accumulate until maxclients reached
    RISK: Connection leak exhausts available connections
    FIX: timeout 300 (close idle connections after 5 minutes)

  WARN: tcp-keepalive = 300 (default)
    Dead connections detected after 5 minutes of silence
    For latency-sensitive apps, reduce to detect failures faster
    FIX: tcp-keepalive 60

  FAIL: No requirepass configured
    Redis accessible without authentication
    RISK: Data exposure, unauthorized access, crypto mining attacks
    FIX: requirepass <strong-password>
    AND: For Redis 6+, use ACL system for fine-grained access:
      user app on >password ~app:* +@read +@write -@admin

  Application Connection Pool:
    Python (redis-py):
      pool = redis.ConnectionPool(max_connections=50)

    WARN: max_connections = 50 per application instance
      With 10 app instances = 500 connections to Redis
      Plus Sentinel connections = ~515 total
      Verify: 515 < maxclients (992 effective)
      PASS: Within limits

    FAIL: No connection timeout configured
      ConnectionPool(max_connections=50)
      FIX: ConnectionPool(
          max_connections=50,
          socket_timeout=5,
          socket_connect_timeout=2,
          retry_on_timeout=True,
          health_check_interval=30,
      )

    FAIL: No retry strategy for connection failures
      Single connection failure raises exception to application
      FIX: Use retry decorator or Retry class:
        from redis.retry import Retry
        from redis.backoff import ExponentialBackoff
        retry = Retry(ExponentialBackoff(), 3)
        Redis(retry=retry, retry_on_error=[ConnectionError, TimeoutError])

Step 7: Analyze Key Design Patterns

The agent reviews key naming and TTL strategy:

Key Pattern Analysis:

  FAIL: No consistent key naming convention
    Found patterns: "user:123", "USER_123", "cache-user-123", "u.123"
    FIX: Standardize on colon-delimited hierarchy:
      {service}:{entity}:{id}:{field}
      app:user:123:profile
      app:session:abc-def
      app:cache:products:category:5

  FAIL: 45% of keys have no TTL set
    Total keys: 2.1M, keys without TTL: 945K
    RISK: Memory grows unbounded — keys never evicted (if policy=volatile-*)
    FIX: Set TTL on all cache keys:
      SET key value EX 3600  (1 hour)
    For session data: match session expiry
    For cache: match data freshness requirements

  WARN: TTL distribution skewed
    Keys with TTL < 60s: 12%
    Keys with TTL 60s-1h: 8%
    Keys with TTL 1h-24h: 15%
    Keys with TTL > 24h: 20%
    Keys with no TTL: 45%
    RECOMMEND: Review keys with TTL > 24h — do they need to persist that long?

  FAIL: Large keys detected (> 1 MB)
    cache:all_products — 12 MB (JSON string)
    RISK: Large key operations block event loop (single-threaded)
    GET on 12 MB key takes ~6ms — blocks all other clients
    FIX: Break into smaller keys or use Redis Hash with HSCAN
    OR: Compress before storage: zlib.compress(json.dumps(data))

  WARN: Hot key detected
    cache:homepage_feed — 50K reads/sec from monitoring
    RISK: Single master handles all reads for this key
    FIX: Read from replicas for hot keys:
      Redis(read_from_replicas=True) (cluster mode)
    OR: Implement client-side caching (Redis 6+ client tracking)

  FAIL: Key pattern "lock:*" without TTL safety
    Distributed locks without TTL — if holder crashes, lock held forever
    FIX: Always set TTL on lock keys:
      SET lock:resource value NX EX 30
    Use Redlock algorithm for distributed locking across nodes

Step 8: Review Replication Health

The agent checks replication configuration:

Replication Analysis:

  Master: 10.0.1.20:6379
    Connected replicas: 2
    Replication backlog: 1 MB (default)
    Min replicas to write: 0

  FAIL: repl-backlog-size = 1mb (default)
    If replica disconnects for > backlog duration, full resync required
    Full resync on 4 GB dataset: ~30 seconds of high CPU + network
    FIX: repl-backlog-size 256mb
    Size = write_rate_bytes_per_sec * max_acceptable_disconnect_seconds
    At 1 MB/s writes: 256 MB covers 256 seconds of disconnect

  FAIL: min-replicas-to-write = 0 (default)
    Master accepts writes even if ALL replicas are down
    RISK: Data only on master — if master fails, data lost
    FIX: min-replicas-to-write 1
    AND: min-replicas-max-lag 10
    Master rejects writes if no replica acknowledged within 10 seconds

  WARN: Replication lag detected
    replica-1: lag = 0 bytes (healthy)
    replica-2: lag = 45000 bytes (45 KB behind)
    Possible causes: slow network, disk I/O on replica, large key writes
    Monitor: redis-cli INFO replication | grep lag

  WARN: replica-read-only = yes but no replica routing
    Replicas accept read queries but application only connects to master
    FIX: Route reads to replicas to reduce master load:
      Python: Redis(host=master, port=6379).slave_for("mymaster")
      Node: new Redis({ role: "slave", preferredSlaves: [...] })

  FAIL: replica-lazy-flush = no
    Full resync flushes replica synchronously — blocks for seconds on large DB
    FIX: replica-lazy-flush yes — flush in background thread

Step 9: Security Audit

The agent evaluates security configuration:

Security Analysis:

  FAIL: protected-mode = no
    Redis accessible from any network interface
    FIX: protected-mode yes (when bind is set)
    OR: Ensure bind 127.0.0.1 or specific internal IPs

  FAIL: bind 0.0.0.0
    Listening on all interfaces including public
    RISK: Internet-accessible Redis — common crypto mining target
    FIX: bind 127.0.0.1 10.0.1.20 (localhost + internal network only)

  FAIL: No TLS configured
    Data in transit is unencrypted — visible to network sniffers
    FIX: For Redis 6+:
      tls-port 6380
      tls-cert-file /etc/redis/tls/redis.crt
      tls-key-file /etc/redis/tls/redis.key
      tls-ca-cert-file /etc/redis/tls/ca.crt
      port 0 (disable non-TLS port)

  FAIL: Using single password (requirepass) instead of ACL
    All clients share one password with full access
    FIX: Use Redis ACL (6.0+) for least-privilege access:
      user app-read on >readpass ~app:cache:* +@read -@all
      user app-write on >writepass ~app:* +@read +@write -@admin
      user admin on >adminpass ~* +@all

  WARN: rename-command used for security
    rename-command FLUSHALL ""
    rename-command CONFIG ""
    NOTE: rename-command is deprecated in Redis 7+
    FIX: Use ACL to restrict dangerous commands instead:
      user default on >pass -FLUSHALL -FLUSHDB -CONFIG -DEBUG

Step 10: Produce the Analysis Report

The agent generates a comprehensive report:

# Redis Configuration Analysis Report
# Deployment: Sentinel | Date: April 30, 2026

## Overview
  Deployment type: Sentinel (3 sentinels, 1 master, 2 replicas)
  Redis version: 7.2
  Total memory: 3.2 GB used / 4 GB max
  Total keys: 2.1M
  Connected clients: 127

## Overall Health Score: 48/100

## Category Scores
  Sentinel/Cluster Config: 5/10  (aggressive timeouts, no alerting)
  Memory Management:       4/10  (no eviction policy, fragmentation)
  Persistence:             4/10  (AOF disabled, fork storm risk)
  Connection Settings:     5/10  (no timeout, no auth, no TLS)
  Replication:             5/10  (small backlog, no min-replicas)
  Key Design:              4/10  (no TTL, large keys, hot keys)
  Security:                3/10  (no TLS, no ACL, bound to 0.0.0.0)
  Performance:             6/10  (lazy-free disabled, no defrag)

## Critical Issues
  1. No authentication — Redis accessible without password
  2. Bound to 0.0.0.0 — exposed to public network
  3. maxmemory-policy noeviction — writes fail when memory full
  4. AOF disabled — up to 5 minutes of data loss on crash
  5. Replication backlog 1 MB — full resync on brief disconnect

## Recommendations Summary
  Estimated effort: 2-3 days for critical + high priority fixes
  Expected improvement: 48 -> 82 health score
  Risk reduction: Eliminates security exposure and data loss scenarios

Output

The agent produces:

Health score: 0-100 overall Redis configuration quality rating
Category scores: granular ratings for each quality dimension
Topology diagram: text-based visualization of Sentinel/Cluster layout
Critical issues: problems that pose availability or security risk
Memory analysis: usage, fragmentation, eviction, and key distribution
Persistence review: RDB/AOF configuration with durability assessment
Replication health: lag, backlog, and failover readiness
Security audit: authentication, encryption, and access control
Remediation config: exact redis.conf directives to fix each issue
Priority matrix: issues ranked by risk and effort

Deployment Type Support

Feature	Standalone	Sentinel	Cluster
HA analysis	N/A	Full sentinel audit	Slot + node analysis
Failover review	N/A	Quorum, timeouts	Node-timeout, migration
Memory analysis	Single node	Master + replicas	Per-shard distribution
Key distribution	N/A	N/A	Slot hotspot detection
Scaling advice	Vertical only	Add replicas	Reshard + add nodes

Tips for Best Results

Provide both redis.conf and sentinel.conf for complete analysis
Include application connection code for pool configuration review
Share Redis INFO output for runtime metrics correlation
For Cluster deployments, provide all node configurations
Run during peak traffic hours for realistic hotspot detection
Combine with slow log analysis (SLOWLOG GET) for performance correlation

cm-redis-cluster-analyzer

Safety Notice

Copy this and send it to your AI assistant to learn