Terragrunt Infrastructure Skill
Manage bare-metal Kubernetes infrastructure from PXE boot to running clusters.
For architecture overview (units vs modules, config centralization), see infrastructure/CLAUDE.md. For detailed unit patterns, see infrastructure/units/CLAUDE.md.
Task Commands (Always Use These)
Validation (run in order)
task tg:fmt # Format HCL files task tg:test-<module> # Test specific module (e.g., task tg:test-config) task tg:validate-<stack> # Validate stack (e.g., task tg:validate-integration)
Operations
task tg:list # List available stacks task tg:plan-<stack> # Plan (e.g., task tg:plan-integration) task tg:apply-<stack> # Apply (REQUIRES HUMAN APPROVAL) task tg:gen-<stack> # Generate stack files task tg:clean-<stack> # Clean generated files
NEVER run terragrunt or tofu directly—always use task commands.
How to Add a Machine
- Edit inventory.hcl :
node50 = { cluster = "live" type = "worker" install = { selector = "disk.model == 'Samsung'" architecture = "amd64" } interfaces = [{ id = "eth0" hardwareAddr = "aa:bb:cc:dd:ee:ff" # VERIFY correct addresses = [{ ip = "192.168.10.50" }] # VERIFY available }] }
-
Run task tg:plan-live
-
Review plan—config module auto-includes machines where cluster == "live"
-
Request human approval before apply
How to Add a Feature Flag
-
Add version to versions.hcl if needed
-
Add feature detection in modules/config/main.tf :
locals { new_feature_enabled = contains(var.features, "new-feature") }
- Enable in stack's features list:
features = ["gateway-api", "longhorn", "new-feature"]
How to Create a New Unit
- Create units/new-unit/terragrunt.hcl :
include "root" { path = find_in_parent_folders("root.hcl") }
terraform { source = "../../../.././/modules/new-unit" }
dependency "config" { config_path = "../config" mock_outputs = { new_unit = {} } }
inputs = dependency.config.outputs.new_unit
-
Create corresponding modules/new-unit/ with variables.tf , main.tf , outputs.tf , versions.tf
-
Add output from config module
-
Add unit block to stacks that need it
How to Write Module Tests
Tests use OpenTofu native testing in modules/<name>/tests/*.tftest.hcl :
Top-level variables set defaults for ALL run blocks
variables { name = "test-cluster" features = ["gateway-api"] machines = { node1 = { cluster = "test-cluster" type = "controlplane" # ... complete machine definition } } }
run "feature_enabled" { command = plan variables { features = ["prometheus"] # Only override what differs } assert { condition = output.prometheus_enabled == true error_message = "Prometheus should be enabled" } }
Run with task tg:test-config or task tg:test for all modules.
Safety Rules
-
NEVER run apply without explicit human approval
-
NEVER use --auto-approve flags
-
NEVER guess MAC addresses or IPs—verify against inventory.hcl
-
NEVER commit .terragrunt-cache/ or .terragrunt-stack/
-
NEVER manually edit Terraform state
State Operations
When removing state entries with indexed resources (e.g., this["rpi4"] ), xargs strips the quotes causing errors. Use a while loop instead:
WRONG - xargs mangles quotes in resource names
terragrunt state list | xargs -n 1 terragrunt state rm
CORRECT - while loop preserves quotes
terragrunt state list | while read -r resource; do terragrunt state rm "$resource"; done
This applies to any state operation on resources with map keys like data.talos_machine_configuration.this["rpi4"] .
Validation Checklist
Before requesting apply approval:
-
task tg:fmt passes
-
task tg:test passes (if module tests exist)
-
task tg:validate passes for ALL stacks
-
task tg:plan-<stack> reviewed
-
No unexpected destroys in plan
-
Network changes won't break connectivity
References
-
stacks.md - Detailed Terragrunt stacks documentation
-
units.md - Detailed Terragrunt units documentation