Kafka Infrastructure as Code (IaC) Deployment
Expert guidance for deploying Apache Kafka using Terraform across multiple platforms.
When to Use This Skill
I activate when you need help with:
-
Terraform deployments: "Deploy Kafka with Terraform", "provision Kafka cluster"
-
Platform selection: "Should I use AWS MSK or self-hosted Kafka?", "compare Kafka platforms"
-
Infrastructure planning: "How to size Kafka infrastructure", "Kafka on AWS vs Azure"
-
IaC automation: "Automate Kafka deployment", "CI/CD for Kafka infrastructure"
What I Know
Available Terraform Modules
This plugin provides 3 production-ready Terraform modules:
- Apache Kafka (Self-Hosted, KRaft Mode)
-
Location: plugins/specweave-kafka/terraform/apache-kafka/
-
Platform: AWS EC2 (can adapt to other clouds)
-
Architecture: KRaft mode (no ZooKeeper dependency)
-
Features:
-
Multi-broker cluster (3-5 brokers recommended)
-
Security groups with SASL_SSL
-
IAM roles for S3 backups
-
CloudWatch metrics and alarms
-
Auto-scaling group support
-
Custom VPC and subnet configuration
-
Use When:
-
✅ You need full control over Kafka configuration
-
✅ Running Kafka 3.6+ (KRaft mode)
-
✅ Want to avoid ZooKeeper operational overhead
-
✅ Multi-cloud or hybrid deployments
-
Variables: module "kafka" { source = "../../plugins/specweave-kafka/terraform/apache-kafka"
environment = "production" broker_count = 3 kafka_version = "3.7.0" instance_type = "m5.xlarge" vpc_id = var.vpc_id subnet_ids = var.subnet_ids domain = "example.com" enable_s3_backups = true enable_monitoring = true }
- AWS MSK (Managed Streaming for Kafka)
-
Location: plugins/specweave-kafka/terraform/aws-msk/
-
Platform: AWS Managed Service
-
Features:
-
Fully managed Kafka service
-
IAM authentication + SASL/SCRAM
-
Auto-scaling (provisioned throughput)
-
Built-in monitoring (CloudWatch)
-
Multi-AZ deployment
-
Encryption in transit and at rest
-
Use When:
-
✅ You want AWS to manage Kafka operations
-
✅ Need tight AWS integration (IAM, KMS, CloudWatch)
-
✅ Prefer operational simplicity over cost
-
✅ Running in AWS VPC
-
Variables: module "msk" { source = "../../plugins/specweave-kafka/terraform/aws-msk"
cluster_name = "my-kafka-cluster" kafka_version = "3.6.0" number_of_broker_nodes = 3 broker_node_instance_type = "kafka.m5.large"
vpc_id = var.vpc_id subnet_ids = var.private_subnet_ids
enable_iam_auth = true enable_scram_auth = false enable_auto_scaling = true }
- Azure Event Hubs (Kafka API)
-
Location: plugins/specweave-kafka/terraform/azure-event-hubs/
-
Platform: Azure Managed Service
-
Features:
-
Kafka 1.0+ protocol support
-
Auto-inflate (elastic scaling)
-
Premium SKU for high throughput
-
Zone redundancy
-
Private endpoints (VNet integration)
-
Event capture to Azure Storage
-
Use When:
-
✅ Running on Azure cloud
-
✅ Need Kafka-compatible API without Kafka operations
-
✅ Want serverless scaling (auto-inflate)
-
✅ Integrating with Azure ecosystem
-
Variables: module "event_hubs" { source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
namespace_name = "my-event-hub-ns" resource_group_name = var.resource_group_name location = "eastus"
sku = "Premium" capacity = 1 kafka_enabled = true auto_inflate_enabled = true maximum_throughput_units = 20 }
Platform Selection Decision Tree
Need Kafka deployment? START HERE:
├─ Running on AWS? │ ├─ YES → Want managed service? │ │ ├─ YES → Use AWS MSK module (terraform/aws-msk) │ │ └─ NO → Use Apache Kafka module (terraform/apache-kafka) │ └─ NO → Continue... │ ├─ Running on Azure? │ ├─ YES → Use Azure Event Hubs module (terraform/azure-event-hubs) │ └─ NO → Continue... │ ├─ Multi-cloud or hybrid? │ └─ YES → Use Apache Kafka module (most portable) │ ├─ Need maximum control? │ └─ YES → Use Apache Kafka module │ └─ Default → Use Apache Kafka module (self-hosted, KRaft mode)
Deployment Workflows
Workflow 1: Deploy Self-Hosted Kafka (Apache Kafka Module)
Scenario: You want full control over Kafka on AWS EC2
1. Create Terraform configuration
cat > main.tf <<EOF module "kafka_cluster" { source = "../../plugins/specweave-kafka/terraform/apache-kafka"
environment = "production" broker_count = 3 kafka_version = "3.7.0" instance_type = "m5.xlarge"
vpc_id = "vpc-12345678" subnet_ids = ["subnet-abc", "subnet-def", "subnet-ghi"] domain = "kafka.example.com"
enable_s3_backups = true enable_monitoring = true
tags = { Project = "MyApp" Environment = "Production" } }
output "broker_endpoints" { value = module.kafka_cluster.broker_endpoints } EOF
2. Initialize Terraform
terraform init
3. Plan deployment (review what will be created)
terraform plan
4. Apply (create infrastructure)
terraform apply
5. Get broker endpoints
terraform output broker_endpoints
Output: ["kafka-0.kafka.example.com:9093", "kafka-1.kafka.example.com:9093", ...]
Workflow 2: Deploy AWS MSK (Managed Service)
Scenario: You want AWS to manage Kafka operations
1. Create Terraform configuration
cat > main.tf <<EOF module "msk_cluster" { source = "../../plugins/specweave-kafka/terraform/aws-msk"
cluster_name = "my-msk-cluster" kafka_version = "3.6.0" number_of_broker_nodes = 3 broker_node_instance_type = "kafka.m5.large"
vpc_id = var.vpc_id subnet_ids = var.private_subnet_ids
enable_iam_auth = true enable_auto_scaling = true
tags = { Project = "MyApp" } }
output "bootstrap_brokers" { value = module.msk_cluster.bootstrap_brokers_sasl_iam } EOF
2. Deploy
terraform init && terraform apply
3. Configure IAM authentication
(module outputs IAM policy, attach to your application role)
Workflow 3: Deploy Azure Event Hubs (Kafka API)
Scenario: You're on Azure and want Kafka-compatible API
1. Create Terraform configuration
cat > main.tf <<EOF module "event_hubs" { source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
namespace_name = "my-kafka-namespace" resource_group_name = "my-resource-group" location = "eastus"
sku = "Premium" capacity = 1 kafka_enabled = true auto_inflate_enabled = true maximum_throughput_units = 20
Create hubs (topics) for your use case
hubs = [ { name = "user-events", partitions = 12 }, { name = "order-events", partitions = 6 }, { name = "payment-events", partitions = 3 } ] }
output "connection_string" { value = module.event_hubs.connection_string sensitive = true } EOF
2. Deploy
terraform init && terraform apply
3. Get connection details
terraform output connection_string
Infrastructure Sizing Recommendations
Small Environment (Dev/Test)
Self-hosted: 1 broker, m5.large
broker_count = 1 instance_type = "m5.large"
AWS MSK: 1 broker per AZ, kafka.m5.large
number_of_broker_nodes = 3 broker_node_instance_type = "kafka.m5.large"
Azure Event Hubs: Basic SKU
sku = "Basic" capacity = 1
Medium Environment (Staging/Production)
Self-hosted: 3 brokers, m5.xlarge
broker_count = 3 instance_type = "m5.xlarge"
AWS MSK: 3 brokers, kafka.m5.xlarge
number_of_broker_nodes = 3 broker_node_instance_type = "kafka.m5.xlarge"
Azure Event Hubs: Standard SKU with auto-inflate
sku = "Standard" capacity = 2 auto_inflate_enabled = true maximum_throughput_units = 10
Large Environment (High-Throughput Production)
Self-hosted: 5+ brokers, m5.2xlarge or m5.4xlarge
broker_count = 5 instance_type = "m5.2xlarge"
AWS MSK: 6+ brokers, kafka.m5.2xlarge, auto-scaling
number_of_broker_nodes = 6 broker_node_instance_type = "kafka.m5.2xlarge" enable_auto_scaling = true
Azure Event Hubs: Premium SKU with zone redundancy
sku = "Premium" capacity = 4 zone_redundant = true maximum_throughput_units = 20
Best Practices
Security Best Practices
Always use encryption in transit
-
Self-hosted: Enable SASL_SSL listener
-
AWS MSK: Set encryption_in_transit_client_broker = "TLS"
-
Azure Event Hubs: HTTPS/TLS enabled by default
Use IAM authentication (when possible)
-
AWS MSK: enable_iam_auth = true
-
Azure Event Hubs: Managed identities
Network isolation
-
Deploy in private subnets
-
Use security groups/NSGs restrictively
-
Azure: Enable private endpoints for Premium SKU
High Availability Best Practices
Multi-AZ deployment
-
Self-hosted: Distribute brokers across 3+ AZs
-
AWS MSK: Automatically multi-AZ
-
Azure Event Hubs: Enable zone_redundant = true (Premium)
Replication factor = 3
-
Self-hosted: default.replication.factor=3
-
AWS MSK: Configured automatically
-
Azure Event Hubs: N/A (fully managed)
min.insync.replicas = 2
- Ensures durability even if 1 broker fails
Cost Optimization
Right-size instances
-
Use ClusterSizingCalculator utility (in kafka-architecture skill)
-
Start small, scale up based on metrics
Auto-scaling (where available)
-
AWS MSK: enable_auto_scaling = true
-
Azure Event Hubs: auto_inflate_enabled = true
Retention policies
-
Set log.retention.hours based on actual needs (default: 168 hours = 7 days)
-
Shorter retention = lower storage costs
Monitoring Integration
All modules integrate with monitoring:
Self-Hosted Kafka
-
CloudWatch metrics (via JMX Exporter)
-
Prometheus + Grafana dashboards (see kafka-observability skill)
-
Custom CloudWatch alarms
AWS MSK
-
Built-in CloudWatch metrics
-
Enhanced monitoring available
-
Integration with CloudWatch Alarms
Azure Event Hubs
-
Built-in Azure Monitor metrics
-
Diagnostic logs to Log Analytics
-
Integration with Azure Alerts
Troubleshooting
"Terraform destroy fails on security groups"
Cause: Resources using security groups still exist Fix:
1. Find dependent resources
aws ec2 describe-network-interfaces --filters "Name=group-id,Values=sg-12345678"
2. Delete dependent resources first
3. Retry terraform destroy
"AWS MSK cluster takes 20+ minutes to create"
Cause: MSK provisioning is inherently slow (AWS behavior) Fix: This is normal. Use --auto-approve for automation:
terraform apply -auto-approve
"Azure Event Hubs: Connection refused"
Cause: Kafka protocol not enabled OR incorrect connection string Fix:
-
Verify kafka_enabled = true in Terraform
-
Use Kafka connection string (not Event Hubs connection string)
-
Check firewall rules (Premium SKU supports private endpoints)
Integration with Other Skills
-
kafka-architecture: For cluster sizing and partitioning strategy
-
kafka-observability: For Prometheus + Grafana setup after deployment
-
kafka-kubernetes: For deploying Kafka on Kubernetes (alternative to Terraform)
-
kafka-cli-tools: For testing deployed clusters with kcat
Quick Reference Commands
Terraform workflow
terraform init # Initialize modules terraform plan # Preview changes terraform apply # Create infrastructure terraform output # Get outputs (endpoints, etc.) terraform destroy # Delete infrastructure
AWS MSK specific
aws kafka list-clusters # List MSK clusters aws kafka describe-cluster --cluster-arn <arn> # Get cluster details
Azure Event Hubs specific
az eventhubs namespace list # List namespaces az eventhubs eventhub list --namespace-name <name> --resource-group <rg> # List hubs
Next Steps After Deployment:
-
Use kafka-observability skill to set up Prometheus + Grafana monitoring
-
Use kafka-cli-tools skill to test cluster with kcat
-
Deploy your producer/consumer applications
-
Monitor cluster health and performance