kafka-engineer

Provides Apache Kafka and event streaming expertise specializing in scalable event-driven architectures and real-time data pipelines. Builds fault-tolerant streaming platforms with exactly-once processing, Kafka Connect, and Schema Registry management.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "kafka-engineer" with this command: npx skills add 404kidwiz/claude-supercode-skills/404kidwiz-claude-supercode-skills-kafka-engineer

Kafka Engineer

Purpose

Provides Apache Kafka and event streaming expertise specializing in scalable event-driven architectures and real-time data pipelines. Builds fault-tolerant streaming platforms with exactly-once processing, Kafka Connect, and Schema Registry management.

When to Use

  • Designing event-driven microservices architectures

  • Setting up Kafka Connect pipelines (CDC, S3 Sink)

  • Writing stream processing apps (Kafka Streams / ksqlDB)

  • Debugging consumer lag, rebalancing storms, or broker performance

  • Designing schemas (Avro/Protobuf) with Schema Registry

  • Configuring ACLs and mTLS security

  1. Decision Framework

Architecture Selection

What is the use case? │ ├─ Data Integration (ETL) │ ├─ DB to DB/Data Lake? → Kafka Connect (Zero code) │ └─ Complex transformations? → Kafka Streams │ ├─ Real-Time Analytics │ ├─ SQL-like queries? → ksqlDB (Quick aggregation) │ └─ Complex stateful logic? → Kafka Streams / Flink │ └─ Microservices Comm ├─ Event Notification? → Standard Producer/Consumer └─ Event Sourcing? → State Stores (RocksDB)

Config Tuning (The "Big 3")

  • Throughput: batch.size , linger.ms , compression.type=lz4 .

  • Latency: linger.ms=0 , acks=1 .

  • Durability: acks=all , min.insync.replicas=2 , replication.factor=3 .

Red Flags → Escalate to sre-engineer :

  • "Unclean leader election" enabled (Data loss risk)

  • Zookeeper dependency in new clusters (Use KRaft mode)

  • Disk usage > 80% on brokers

  • Consumer lag constantly increasing (Capacity mismatch)

  1. Core Workflows

Workflow 1: Kafka Connect (CDC)

Goal: Stream changes from PostgreSQL to S3.

Steps:

Source Config (postgres-source.json )

{ "name": "postgres-source", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "db-host", "database.dbname": "mydb", "database.user": "kafka", "plugin.name": "pgoutput" } }

Sink Config (s3-sink.json )

{ "name": "s3-sink", "config": { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "s3.bucket.name": "my-datalake", "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat", "flush.size": "1000" } }

Deploy

Workflow 3: Schema Registry Integration

Goal: Enforce schema compatibility.

Steps:

Define Schema (user.avsc )

{ "type": "record", "name": "User", "fields": [ {"name": "id", "type": "int"}, {"name": "name", "type": "string"} ] }

Producer (Java)

  1. Anti-Patterns & Gotchas

❌ Anti-Pattern 1: Large Messages

What it looks like:

  • Sending 10MB images payload in Kafka message.

Why it fails:

  • Kafka is optimized for small messages (< 1MB). Large messages block the broker threads.

Correct approach:

  • Store image in S3.

  • Send Reference URL in Kafka message.

❌ Anti-Pattern 2: Too Many Partitions

What it looks like:

  • Creating 10,000 partitions on a small cluster.

Why it fails:

  • Slow leader election (Zookeeper overhead).

  • High file handle usage.

Correct approach:

  • Limit partitions per broker (~4000). Use fewer topics or larger clusters.

❌ Anti-Pattern 3: Blocking Consumer

What it looks like:

  • Consumer doing heavy HTTP call (30s) for each message.

Why it fails:

  • Rebalance storm (Consumer leaves group due to timeout).

Correct approach:

  • Async Processing: Move work to a thread pool.

  • Pause/Resume: consumer.pause() if buffer is full.

  1. Quality Checklist

Configuration:

  • Replication: Factor 3 for production.

  • Min.ISR: 2 (Prevents data loss).

  • Retention: Configured correctly (Time vs Size).

Observability:

  • Lag: Consumer Lag monitored (Burrow/Prometheus).

  • Under-replicated: Alert on under-replicated partitions (>0).

  • JMX: Metrics exported.

Examples

Example 1: Real-Time Fraud Detection Pipeline

Scenario: A financial services company needs real-time fraud detection using Kafka streaming.

Architecture Implementation:

  • Event Ingestion: Kafka Connect CDC from PostgreSQL transaction database

  • Stream Processing: Kafka Streams application for real-time pattern detection

  • Alert System: Producer to alert topic triggering notifications

  • Storage: S3 sink for historical analysis and compliance

Pipeline Configuration:

Component Configuration Purpose

Topics 3 (transactions, alerts, enriched) Data organization

Partitions 12 (3 brokers × 4) Parallelism

Replication 3 High availability

Compression LZ4 Throughput optimization

Key Logic:

  • Detects velocity patterns (5+ transactions in 1 minute)

  • Identifies geographic anomalies (impossible travel)

  • Flags high-risk merchant categories

Results:

  • 99.7% of fraud detected in under 100ms

  • False positive rate reduced from 5% to 0.3%

  • Compliance audit passed with zero findings

Example 2: E-Commerce Order Processing System

Scenario: Build a resilient order processing system with Kafka for high reliability.

System Design:

  • Order Events: Topic for order lifecycle events

  • Inventory Service: Consumes orders, updates stock

  • Payment Service: Processes payments, publishes results

  • Notification Service: Sends confirmations via email/SMS

Resilience Patterns:

  • Dead Letter Queue for failed processing

  • Idempotent producers for exactly-once semantics

  • Consumer groups with manual offset management

  • Retries with exponential backoff

Configuration:

Producer Configuration

acks: all retries: 3 enable.idempotence: true

Consumer Configuration

auto.offset.reset: earliest enable.auto.commit: false max.poll.records: 500

Results:

  • 99.99% message delivery reliability

  • Zero duplicate orders in 6 months

  • Peak processing: 10,000 orders/second

Example 3: IoT Telemetry Platform

Scenario: Process millions of IoT device telemetry messages with Kafka.

Platform Architecture:

  • Device Gateway: MQTT to Kafka proxy

  • Data Enrichment: Stream processing adds device metadata

  • Time-Series Storage: S3 sink partitioned by device_id/date

  • Real-Time Alerts: Threshold-based alerting for anomalies

Scalability Configuration:

  • 50 partitions for parallel processing

  • Compression enabled for cost optimization

  • Retention: 7 days hot, 1 year cold in S3

  • Schema Registry for data contracts

Performance Metrics:

Metric Value

Throughput 500,000 messages/sec

Latency (P99) 50ms

Consumer lag < 1 second

Storage efficiency 60% reduction with compression

Best Practices

Topic Design

  • Naming Conventions: Use clear, hierarchical topic names (domain.entity.event)

  • Partition Strategy: Plan for future growth (3x expected throughput)

  • Retention Policies: Match retention to business requirements

  • Cleanup Policies: Use delete for time-based, compact for state

  • Schema Management: Enforce schemas via Schema Registry

Producer Optimization

  • Batching: Increase batch.size and linger.ms for throughput

  • Compression: Use LZ4 for balance of speed and size

  • Acks Configuration: Use all for reliability, 1 for latency

  • Retry Strategy: Implement retries with backoff

  • Idempotence: Enable for exactly-once semantics in critical paths

Consumer Best Practices

  • Offset Management: Use manual commit for critical processing

  • Batch Processing: Increase max.poll.records for efficiency

  • Rebalance Handling: Implement graceful shutdown

  • Error Handling: Dead letter queues for poison messages

  • Monitoring: Track consumer lag and processing time

Security Configuration

  • Encryption: TLS for all client-broker communication

  • Authentication: SASL/SCRAM or mTLS for production

  • Authorization: ACLs with least privilege principle

  • Quotas: Implement client quotas to prevent abuse

  • Audit Logging: Log all access and configuration changes

Performance Tuning

  • Broker Configuration: Optimize for workload type (throughput vs latency)

  • JVM Tuning: Heap size and garbage collector selection

  • OS Tuning: File descriptor limits, network settings

  • Monitoring: Metrics for throughput, latency, and errors

  • Capacity Planning: Regular review and scaling assessment

Security:

  • Encryption: TLS enabled for Client-Broker and Inter-broker.

  • Auth: SASL/SCRAM or mTLS enabled.

  • ACLs: Principle of least privilege (Topic read/write).

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

frontend-ui-ux-engineer

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

quant-analyst

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

project-manager

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

machine-learning-engineer

No summary provided by upstream source.

Repository SourceNeeds Review