architecture-review

Architecture Evaluation Framework

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "architecture-review" with this command: npx skills add ionfury/homelab/ionfury-homelab-architecture-review

Architecture Evaluation Framework

Current Technology Stack

Layer Technology Purpose

OS Talos Linux Immutable, API-driven Kubernetes OS

GitOps Flux + ResourceSets Declarative cluster state reconciliation

CNI/Network Cilium eBPF networking, network policies, Hubble observability

Storage Longhorn Distributed block storage with S3 backup

Object Storage Garage S3-compatible distributed object storage

Database CNPG (CloudNativePG) PostgreSQL operator with HA and backups

Cache/KV Dragonfly Redis-compatible in-memory store

Monitoring kube-prometheus-stack Prometheus + Grafana + Alertmanager

Logging Alloy → Loki Log collection pipeline

Certificates cert-manager Automated TLS certificate management

Secrets ESO + AWS SSM External Secrets Operator with Parameter Store

Upgrades Tuppr Declarative Talos/Kubernetes/Cilium upgrades

Infrastructure Terragrunt + OpenTofu Infrastructure as Code for bare-metal provisioning

CI/CD GitHub Actions + OCI Artifact-based promotion pipeline

Evaluation Criteria

When evaluating any proposed technology addition or architecture change, assess against these criteria:

  1. Principle Alignment

Score the proposal against each core principle (Strong/Weak/Neutral):

  • Enterprise at Home: Does it reflect production-grade patterns?

  • Everything as Code: Can it be fully represented in git?

  • Automation is Key: Does it reduce or increase manual toil?

  • Learning First: Does it teach valuable enterprise skills?

  • DRY and Code Reuse: Does it leverage existing patterns or create duplication?

  • Continuous Improvement: Does it make the system more maintainable?

  1. Stack Fit
  • Does this overlap with existing tools? (e.g., adding Redis when Dragonfly exists)

  • Does it integrate with the GitOps workflow? (Must be Flux-deployable)

  • Does it work on bare-metal? (No cloud-only services)

  • Does it support the multi-cluster model? (dev → integration → live)

  1. Operational Cost
  • How is it monitored? (Must integrate with kube-prometheus-stack)

  • How is it backed up? (Must have a recovery story)

  • How does it handle upgrades? (Must be declarative, ideally via Renovate)

  • What's the failure blast radius? (Isolated > cluster-wide)

  1. Complexity Budget
  • Is the complexity justified by the learning value?

  • Could a simpler existing tool solve the same problem?

  • What's the maintenance burden over 12 months?

  1. Alternative Analysis
  • What existing stack components could solve this? (Always check first)

  • What are the top 2-3 alternatives in the ecosystem?

  • What do other production homelabs use? (kubesearch research)

  1. Failure Modes
  • What happens when this component is unavailable?

  • How does it interact with network policies? (Default deny)

  • What's the recovery procedure? (Must be documented in a runbook)

  • Can it self-heal? (Strong preference for self-healing)

Common Design Patterns

New Application

  • HelmRelease via ResourceSet (flux-gitops pattern)

  • Namespace with network-policy profile label

  • ExternalSecret for credentials

  • ServiceMonitor + PrometheusRule for observability

  • GarageBucketClaim if S3 storage needed

  • CNPG Cluster if database needed

New Infrastructure Component

  • OpenTofu module in infrastructure/modules/

  • Unit in appropriate stack under infrastructure/units/

  • Test coverage in .tftest.hcl files

  • Version pinned in versions.env if applicable

New Secret

  • Store in AWS SSM Parameter Store

  • Reference via ExternalSecret CR

  • Never commit to git, not even encrypted

New Storage

  • Longhorn PVC for block storage (default)

  • GarageBucketClaim for object storage (S3-compatible)

  • Never use hostPath or emptyDir for persistent data

New Database

  • CNPG Cluster CR for PostgreSQL

  • Automated backups to Garage S3

  • Connection pooling via PgBouncer (CNPG-managed)

New Network Exposure

  • HTTPRoute for HTTP/HTTPS traffic (Gateway API)

  • Appropriate network-policy profile label

  • cert-manager Certificate for TLS

  • Internal gateway for internal-only services

Anti-Patterns to Challenge

Anti-Pattern Why It's Wrong Correct Approach

"Just run a container" without monitoring Invisible failures, no alerting ServiceMonitor + PrometheusRule required

Adding a new tool when existing ones suffice Stack bloat, maintenance burden Evaluate existing stack first

Skipping observability "for now" Technical debt that never gets paid Monitoring is day-1, not day-2

Manual operational steps Drift, inconsistency, bus factor Everything declarative via GitOps

Cloud-only services Vendor lock-in, can't run on bare-metal Self-hosted alternatives preferred

Single-instance without HA story Single point of failure At minimum, document recovery procedure

Storing state outside git Shadow configuration, drift Git is the source of truth

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

prometheus

No summary provided by upstream source.

Repository SourceNeeds Review
General

opentofu-modules

No summary provided by upstream source.

Repository SourceNeeds Review
General

taskfiles

No summary provided by upstream source.

Repository SourceNeeds Review
General

terragrunt

No summary provided by upstream source.

Repository SourceNeeds Review