Production-grade infrastructure-as-code repository demonstrating enterprise DevOps practices, GitOps workflows, and cloud-native architectures implemented in a homelab environment.
OpenTofu-Driven Infrastructure
- Modular two-tier architecture: base modules and application modules promoting composability and reusability
- S3-compatible remote state backend (s3.toolz.homelabz.eu) with workspace isolation per environment
- Automated state backup to Oracle Cloud Object Storage via CronJob for disaster recovery
- YAML-driven VM provisioning using dynamic
for_eachloops for declarative infrastructure definitions
Configuration Management
- Ansible playbooks for VM configuration (K3s, vanilla Kubernetes, HAProxy, Talos Linux)
- Dynamic inventory auto-generated from Terraform outputs and automatically committed to Git
- Integration with HashiCorp Vault for centralized kubeconfig and secrets management
- Idempotent playbook design for reliable repeated execution
Git as Single Source of Truth
- All infrastructure changes submitted via pull requests with automated validation
- Commit message parsing for workflow automation triggers
- ArgoCD implementation with ApplicationSets and cluster generator for multi-environment deployments
- Sync waves and hooks for ordered, controlled deployments
- Self-healing enabled with automatic drift detection and remediation
Automated CI/CD Pipelines
7 GitHub Actions workflows provide comprehensive automation, complemented by local pre-commit hooks for shift-left security:
| Workflow | Purpose | Trigger |
|---|---|---|
| opentofu.yml | OpenTofu workflow, plans on PR and apply on merge | PR/Merge to main |
| ansible.yml | VM provisioning via [ansible PLAYBOOK] commit tag |
Commit tag detection |
| build.yml | Docker image builds on Dockerfile changes | File path changes |
| sec-trivy.yml | Container and IaC vulnerability scanning | Pull request / Push |
| sec-trufflehog.yml | Secret leak detection in commits | Pull request / Push |
| conventional-commits.yml | Commit message validation | Pull request |
| release.yml | Semantic versioning and changelog generation | Merge to main |
OpenTofu Workflow Deep Dive (opentofu.yml):
Plan Phase (on PR):
- Detects changes in
clusters/**andsecrets/** - Runs
make planfor all environments - Posts truncated plan summary to PR (65KB limit) with full plan in artifacts
Apply Phase (on merge to main):
- Capture Before State: Stores cluster names from
proxmox_cluster_namesoutput - Apply Changes: Runs
make applyfor all environments - Capture After State: Compares cluster outputs to detect changes
- Cluster Change Detection: Identifies NEW/DELETED clusters across clustermgmt/toolz/observability workspaces
- Wait for Cluster Availability:
- Monitors Cluster API objects on management cluster (clustermgmt)
- Waits up to 30 minutes with 10-second polling for cluster to reach "Available" status
- Checks
cluster.cluster.x-k8s.ioCRD status condition
- Update Kubeconfigs in SOPS:
- Calls clusters/scripts/update_kubeconfig_sops.sh
- Extracts kubeconfig from Cluster API secret:
<cluster>-kubeconfigin namespace<cluster> - Merges with existing kubeconfig in SOPS file
- Updates encrypted secrets/common/cluster-secret-store/secrets/KUBECONFIG.yaml
- Commits with message: "chore: Update KUBECONFIG with Cluster API clusters"
- Sync to Vault:
- Runs
make apply ENV=toolz TARGET='module.vault[0]' - External Secrets Operator syncs to Kubernetes secrets in namespaces with
cluster-secrets=truelabel
- Runs
- Error Handling:
- Kubeconfig update failures post warning comment with manual recovery steps
- Workflow continues on error to prevent blocking deployments
External Secrets Automation:
- When secrets are added/removed in secrets/ directory, OpenTofu automatically updates ExternalSecret objects
- Changes propagate to all clusters via Vault synchronization
- Namespaces with label
cluster-secrets=truereceive updated secrets automatically on 'cluster-secrets' secret - 'cluster-secrets' secret automatically mounted and available as environment variables on github runners
Progressive Delivery with Argo Rollouts
Blue-Green deployment strategy with automated E2E testing and production promotion:
- Build Phase: GitHub Actions (build.yml) builds and pushes container image to Harbor registry (
registry.toolz.homelabz.eu) - Dev Deployment: Pipeline updates dev kustomization with new image tag, ArgoCD syncs to dev cluster
- E2E Testing: Argo Rollouts triggers prePromotionAnalysis running Cypress tests against dev environment
- Auto-Promotion: On test success, postPromotionAnalysis job automatically promotes to prod by updating prod kustomization
- Prod Deployment: ArgoCD syncs prod overlay, Argo Rollouts performs Blue-Green switch
Build → Push Image → Update Dev Tag → ArgoCD Sync → Cypress Tests → Update Prod Tag → ArgoCD Sync → Live
ArgoCD ApplicationSet Pattern:
ApplicationSets use matrix generators combined with the ArgoCD cluster generator to deploy applications across registered clusters:
generators:
- matrix:
generators:
- list: # Applications
elements:
- app: cks-backend
repoURL: https://github.com/homelabz-eu/cks-backend
- app: cks-frontend
repoURL: https://github.com/homelabz-eu/cks-frontend
- clusters: # Environments resolved from registered clusters
selector:
matchExpressions:
- key: environment
operator: In
values: [dev, prod]Bootstrap Architecture:
- IaC creates an
argocd-bootstrapApplication that recursively syncs the argocd-apps/ directory - ApplicationSets generate Applications per app/cluster combination
- Standalone Applications for apps that don't fit the matrix pattern
- Auto-sync with pruning and self-healing enabled
- Retry policy: 5 attempts with exponential backoff
Application Repositories:
- cks-backend - Backend APIs with Kustomize overlays
- cks-frontend - Web UI with Kustomize overlays
- cks-terminal-mgmt - Terminal management microservice (toolz)
Key components:
- Per-app AnalysisTemplates for Cypress test execution (e.g.,
cypress-tests-cks-backend) - Per-app AnalysisTemplates for production promotion via git commit automation (e.g.,
promote-cks-backend-to-prod) autoPromotionEnabled: truefor fully automated pipeline- Prod overlay removes prePromotionAnalysis (tests only run in dev)
- Kustomize overlays for environment-specific configuration
Self-Hosted Runner Infrastructure
- Actions Runner Controller (ARC) deployed on toolz cluster
- Two runner scale sets:
self-hosted(general-purpose, unprivileged) andself-hosted-buildkit(container builds via remote BuildKit daemon) - Dedicated BuildKit daemon (single privileged pod) serving all build runners over TCP — eliminates per-runner DinD sidecars
- Runner pods run fully unprivileged (
runAsNonRoot,allowPrivilegeEscalation: false) - Custom runner image with kubectl, Helm, Terraform, SOPS, buildctl, and cloud provider tools
- GitLab CI runners for multi-platform pipeline support
- Centralized reusable workflows in dedicated pipelines repository
Key Data Flows:
- Infrastructure: Git → CI/CD → clustermgmt (CAPI) → Workload Clusters
- Secrets: SOPS → CI/CD → Vault → External Secrets → K8s Secrets
- Applications: External Repos → ArgoCD → Deployments
- Progressive Delivery: Build → Dev → Tests → Prod (automated)
Hub-and-Spoke Architecture
Central observability hub on dedicated cluster:
- Prometheus (kube-prometheus-stack v79.0.1) for metrics aggregation
- Grafana for unified multi-cluster dashboards
- Jaeger v2.57.0 for distributed tracing
- Loki v6.28.0 for log aggregation
- OpenTelemetry Collector v0.33.0 for telemetry ingestion
Edge collectors on all workload clusters:
- Fluent Bit v0.48.9 for log forwarding
- Prometheus with remote write capability
- OpenTelemetry Collector for traces and metrics
- Automatic cluster labeling for multi-cluster aggregation
Application Instrumentation
- OpenTelemetry SDK integration in Go microservices
- Structured JSON logging with trace context correlation
- ServiceMonitor CRDs for automatic Prometheus scraping
- Custom dashboards for PostgreSQL, Redis, NATS with predefined alerting rules
clustermgmt cluster serves as Cluster API management cluster, deploying and managing workload clusters on Proxmox infrastructure:
Cluster Provisioning Workflow:
- Define Cluster: Add cluster configuration to
clusters/variables.tfunderclustermgmtworkspace (kubernetes-clusterlist) - CI/CD Apply: GitHub Actions runs
make apply ENV=clustermgmt, creating Cluster API manifests - Cluster API Provisioning: Cluster API operator provisions VMs on Proxmox and bootstraps Kubernetes
- Automated Kubeconfig Management:
- CI/CD detects cluster changes via output comparison (before/after apply)
- Waits up to 30 minutes for cluster to reach "Available" status
- Extracts kubeconfig from Cluster API secret (
<cluster>-kubeconfigin namespace<cluster>) - Updates SOPS-encrypted KUBECONFIG.yaml
- Commits changes automatically and syncs to Vault
- External Secrets Operator propagates to all clusters with
cluster-secrets=truelabel
- Immediate Availability: Cluster ready for OpenTofu and CI/CD pipelines
Cluster API Components:
- Management Cluster: clustermgmt (K3s on NODE02)
- Cluster API Operator: v1.12.0 with CAPMOX v0.7.5 (Proxmox provider)
- Supported Distributions:
- Talos Linux: Immutable infrastructure with declarative configuration (dev cluster)
- kubeadm: Standard Kubernetes with HA support (prod cluster)
- K3s, K0s, RKE2: Additional distributions supported
- Polymorphic Module: modules/apps/kubernetes-cluster supports all cluster types
Key Script: clusters/scripts/update_kubeconfig_sops.sh handles kubeconfig extraction, merging, SOPS encryption, and git commits.
Automatic per-PR test environments for application repositories, providing isolated Kubernetes clusters for each pull request:
Architecture:
- Cluster Provisioning: K3s clusters via Cluster API on clustermgmt cluster
- Infrastructure: OpenTofu with workspace-per-PR isolation (ephemeral-clusters/opentofu/)
- IP Management: Vault-based IP pool allocation (192.168.1.140-149, 2 IPs per cluster: VIP + node)
- DNS: Pi-hole for internal resolution (
pr-<number>-<repo>.ephemeral.homelabz.eu) - Capacity: 5 concurrent ephemeral clusters maximum
Workflow Lifecycle (example: cks-backend/.github/workflows/ephemeral.yml):
-
PR Opened:
- Check cluster existence via Cluster API
- Allocate 2 IPs from pool using ip_pool_manager.sh
- Render and apply K3s Cluster API manifest
- Wait for cluster "Available" status (~2 min)
- Extract kubeconfig on-demand (not stored in Vault)
- Apply infrastructure via
make ephemeral-apply(4 phases):- Phase 1: Base operators without CRDs (cert-manager, external-dns, external-secrets)
- Phase 2: Base operators with CRDs (ClusterIssuer, DNSEndpoint, ExternalSecret)
- Phase 3: Apps without postgres CRDs
- Phase 4: Apps with postgres CRDs
- Build and push Docker image (
registry.toolz.homelabz.eu/library/<app>:pr-<number>) - Deploy app with
kubectl apply -k kustomize/overlays/ephemeral/ - Run Cypress E2E tests in container
- Post PR comment with environment URL
-
PR Updated (new commits):
- Detect existing cluster (skip provisioning)
- Build new image with same PR tag
- Redeploy application
- Run tests
-
PR Closed:
- Destroy infrastructure with
make ephemeral-destroy - Delete Cluster API cluster object
- Release IPs back to pool
- Complete (images remain in Harbor for audit)
- Destroy infrastructure with
Key Components:
- IP Pool Manager: clusters/scripts/ip_pool_manager.sh - Vault-based allocation with CAS (Check-And-Set) for concurrency
- Cluster Template: ephemeral-clusters/cluster-api/k3s-cluster.yaml.tpl - Single-node K3s with external control plane VIP
- OpenTofu Modules: Reuse production-proven modules from modules/apps/
- Makefile Commands:
ephemeral-init,ephemeral-apply,ephemeral-destroy,ephemeral-workspace
Benefits:
- Isolated test environment per PR
- Automatic provisioning and cleanup
- Full production-like infrastructure (DNS, TLS, secrets)
- No manual intervention required
- Parallel PR testing (up to 5 concurrent)
The [ansible PLAYBOOK] pattern remains supported for existing K3s clusters (clustermgmt, home, observability):
git commit -m "feat(proxmox): add k8s-observability VM [ansible k8s-observability]"Automated Pipeline Chain:
- Release workflow generates semantic version tag
- OpenTofu creates VM from YAML definition in init/vms/
- OpenTofu updates Ansible inventory, preserving
[ansible]tag in commit message - Ansible workflow triggers on tag detection
- Ansible installs and configures Kubernetes (K3s, kubeadm, vanilla, Talos)
- Python script extracts kubeconfig, updates IP and context name
- Kubeconfig merged into central Vault KV store
- Cluster immediately available to OpenTofu and CI/CD pipelines
This method provisioned all current clusters but is being phased out in favor of Cluster API for new deployments.
OpenTofu automatically deploys platform services to clusters based on workspace configuration in clusters/variables.tf:
Core Infrastructure (most clusters):
- cert-manager for automated TLS (self-signed or Let's Encrypt via ACME)
- External-DNS for dynamic DNS record management (Cloudflare)
- External Secrets for Vault → Kubernetes secret synchronization
- Metrics Server for resource metrics
Ingress & Service Mesh:
- MetalLB for LoadBalancer service type (dev, prod)
- Istio service mesh with mTLS (dev, prod)
- Ingress-NGINX (alternative ingress controller)
GitOps & CI/CD:
- ArgoCD for GitOps application delivery (dev, prod, toolz)
- GitHub Actions Runner Controller (toolz)
- GitLab CI runners (toolz)
Observability:
- Observability-box (edge collector) on workload clusters
- Full observability stack (Prometheus, Grafana, Jaeger, Loki) on observability cluster
Storage:
- Local Path Provisioner for dynamic local storage (dev, prod)
- Longhorn distributed storage (toolz for snapshot capability)
Data Services (toolz cluster):
- CloudNativePG for PostgreSQL
- Redis for caching
- NATS for messaging
- MinIO for S3-compatible object storage
Cluster Management (clustermgmt cluster only):
- Cluster API Operator v1.12.0 with CAPMOX v0.7.5
- Cluster Autoscaler
Platform Services (toolz cluster):
- HashiCorp Vault for centralized secrets
- Harbor container registry
- Teleport agent for secure access
Configuration is workspace-specific via workload variable - modules only deploy when listed in workspace's workload array.
Automated Backup Strategy
- Terraform state backed up daily to Oracle Cloud Object Storage via CronJob
Recovery Capabilities
- Complete infrastructure reproducible from Git repository alone
- Terraform state restoration from Oracle Cloud backups
Multi-Cluster Architecture
- 6 environment-isolated Kubernetes clusters (dev, prod, clustermgmt, toolz, home, observability)
- Production workloads distributed across multiple replicas via Kustomize overlays
- HAProxy load balancer for vanilla Kubernetes traffic distribution
- MetalLB for LoadBalancer service type support on bare metal
Resource Management
- Resource limits and requests prevent resource exhaustion
- Namespace quotas for multi-tenant isolation
- Anti-affinity rules for pod distribution (where configured)
Multi-Layered Defense
-
Encryption at Rest: SOPS with age encryption for all secrets in Git
- Age public key:
age15vvdhaj90s3nru2zw4p2a9yvdrv6alfg0d6ea9zxpx3eagyqfqlsgdytsp - Automated scripts:
secret_new.sh,secret_edit.sh,secret_view.sh
- Age public key:
-
Runtime Secret Storage: HashiCorp Vault deployed on toolz cluster
- KV v2 engine for versioned secrets
- Kubernetes authentication backend
- Dynamic policy creation via Terraform
- Accessible at vault.toolz.homelabz.eu
-
Kubernetes Integration: External Secrets Operator
- ClusterSecretStore for multi-namespace secret distribution
- Automatic synchronization from Vault to Kubernetes secrets
- Support for secret rotation with namespace selector
cluster-secrets=true
Secret Lifecycle
Git (SOPS encrypted) → CI/CD (decrypt) → Vault (runtime) → External Secrets → K8s Secrets → Pods
Automated TLS
- cert-manager with configurable ClusterIssuer (
letsencrypt-prod) - Supports self-signed mode (
issuer_type = "selfsigned") or ACME/Let's Encrypt (issuer_type = "acme") - ACME mode uses Cloudflare DNS-01 challenge for wildcard certificate support
- Istio Gateway integration for TLS termination
- Certificate validation monitoring
- Environment-specific gateway DNS names configured via OpenTofu variables:
- Dev:
dev.app.homelabz.eupattern - Prod:
app.homelabz.eupattern (no prefix)
- Dev:
Service Mesh Implementation
- Istio deployed on dev cluster for traffic encryption and observability
- Mutual TLS (mTLS) capability between services
- VirtualServices for fine-grained routing control
- Gateway resources for ingress traffic management
- SNI-based routing for TLS passthrough (PostgreSQL example)
DNS Security
- External-DNS with TXT record ownership verification
- Cloudflare integration for public DNS with WAF protection
- Pi-hole for internal DNS with ad-blocking
- Automatic DNS record lifecycle management
Kubernetes RBAC
- ServiceAccounts with minimal permissions for all components
- ClusterRoles for platform services (External-DNS, External Secrets)
- Namespace-based isolation for tenant workloads
- Vault policies for least-privilege access
CI/CD Security
- Self-hosted runners in isolated network environment
- Secret injection only in authorized workflows
- No secrets embedded in container images
- Container scanning with Trivy before deployment
Pre-Commit Hooks (.pre-commit-config.yaml):
- SOPS encryption guard preventing unencrypted secrets from being committed
detect-secretsfor credential leak detection across all file typesdetect-private-keyfor SSH/PGP key detection- OpenTofu formatting enforcement via
terraform_fmt - Shell script validation via
shellcheck - File hygiene checks (trailing whitespace, merge conflicts, large files, YAML/JSON validation)
Setup: make pre-commit-install | Run manually: make pre-commit-run
Continuous Vulnerability Assessment
- Trivy scanning for containers and IaC on every pull request
- TruffleHog secret leak detection in commit history
- SARIF output integration with GitHub Security tab
- Configurable blocking on critical vulnerabilities
Secure Container Practices
- Multi-stage Docker builds minimizing attack surface
- Non-root user execution enforced
- Minimal base images (Alpine, distroless)
- Regular base image updates via Renovate/Dependabot
infra/
├── clusters/ # Kubernetes workload definitions (OpenTofu)
│ ├── variables.tf # Workspace configurations and Cluster API cluster definitions
│ ├── modules.tf # All module invocations with conditional logic
│ ├── load_secrets.py # SOPS decryption to JSON
│ └── scripts/
│ └── update_kubeconfig_sops.sh # Kubeconfig extraction and SOPS management
├── modules/
│ ├── base/ # 10 foundational modules (helm, namespace, ingress, monitoring, credentials, persistence, istio-gateway, istio-virtualservice, cnpg-database, values-template)
│ └── apps/ # 33 application modules categorized as:
│ ├── Cluster: kubernetes-cluster, clusterapi-operator
│ ├── Infrastructure: istio, metallb, external-secrets, cert-manager, externaldns, ingress-nginx, kubelet-csr-approver, local-path-provisioner, metrics-server
│ ├── Data: cloudnative-postgres, cloudnative-postgres-operator, redis, nats, minio
│ ├── Observability: observability, observability-box
│ ├── CI/CD: argocd, github-runner, gitlab-runner
│ ├── Security: vault, teleport-agent, authentik, falco
│ ├── Storage: longhorn, local-path-provisioner
│ ├── Virtualization: kubevirt, kubevirt-operator
│ └── Other: harbor, immich, registry, cluster-autoscaler, oracle-backup
├── init/
│ ├── vms/ # YAML VM definitions for declarative provisioning (legacy)
│ ├── playbooks/ # Ansible configuration playbooks (legacy K3s clusters)
│ └── scripts/ # Automation scripts (Talos, kubeconfig management)
├── argocd-apps/ # GitOps application manifests
│ ├── cks-apps-applicationset.yaml # CKS platform apps (matrix + cluster generator)
│ ├── cks-terminal-mgmt-toolz.yaml # Standalone toolz application
│ └── clusters/ # Cluster registration and repo secrets
├── secrets/ # SOPS-encrypted secrets (age encryption)
│ └── common/cluster-secret-store/secrets/ # Cluster-wide secrets synced via External Secrets
├── .github/workflows/ # CI/CD automation pipelines
├── Makefile # Development commands (plan, apply, init, fmt, validate)
└── docs/ # Technical documentation
Compute Resources
- NODE01: Acer Nitro (i7-4710HQ, 16GB RAM)
- NODE02: HP ED800 G3 Mini (i7-7700T, 32GB RAM)
- NODE03: X99 dual Xeon E5-2699-V3 18-Core, 128GB RAM
Virtualization: Proxmox VE managing 10+ VMs across 3 physical hosts
| Cluster | Type | Provisioning | Purpose | Node(s) | Key Workloads |
|---|---|---|---|---|---|
| clustermgmt | K3s | Legacy Ansible | Cluster API management cluster (management plane only) | k8s-tools (single node, NODE02) | Cluster API operator, Cluster Autoscaler |
| toolz | RKE2 | Cluster API | Platform services, workloads & CKS platform | 1 CP + 2 workers (NODE03) | CloudNativePG, Redis, NATS, CI/CD runners (GitHub/GitLab), Vault, Harbor, MinIO, ArgoCD, Falco, KubeVirt, Longhorn, cks-terminal-mgmt |
| dev | Talos | Cluster API | Development environment | 1 CP + 2 workers | Development services, Istio service mesh, ArgoCD |
| prod | kubeadm | Cluster API | Production environment | 1 CP + 2 workers | Production services, Istio service mesh, ArgoCD |
| home | K3s | Legacy Ansible | Home automation | k8s-home (single node) | Immich photo management, External Secrets |
| observability | K3s | Legacy Ansible | Central monitoring hub | k8s-observability (single node) | Prometheus (kube-prometheus-stack), Grafana, Jaeger, Loki, OpenTelemetry Collector |
Infrastructure Layer
- OpenTofu 1.11.3+
- Ansible (legacy cluster provisioning)
- Proxmox VE (hypervisor)
- Cluster API v1.12.0 with CAPMOX v0.7.5 (Proxmox provider)
Kubernetes Distributions
- K3s (lightweight, single-node)
- kubeadm (standard multi-node)
- Talos Linux (immutable infrastructure)
Platform Services
- ArgoCD 7.7.12 (GitOps)
- HashiCorp Vault (secrets management)
- cert-manager (TLS automation)
- External Secrets Operator (Vault → K8s sync)
- Istio (service mesh)
- MetalLB (LoadBalancer)
Observability
- Prometheus (kube-prometheus-stack v79.0.1)
- Grafana (dashboards)
- Jaeger v2.57.0 (distributed tracing)
- Loki v6.28.0 (log aggregation)
- OpenTelemetry Collector v0.33.0 (telemetry ingestion)
- Fluent Bit v0.48.9 (log forwarding)
Data Services
- CloudNativePG operator (PostgreSQL 15+ with pgvector, automated backups, managed roles/databases)
- Redis (Bitnami)
- NATS with JetStream (messaging)
- MinIO (S3-compatible object storage)
CI/CD
- GitHub Actions with Actions Runner Controller (ARC)
- GitLab CI runners
- Argo Rollouts (Blue-Green deployments with automated E2E testing via Cypress)
- Harbor (container registry)
- Custom runner images with kubectl, Helm, OpenTofu, SOPS, buildctl
Security
- SOPS with age encryption
- Pre-commit hooks (SOPS guard, detect-secrets, detect-private-key, shellcheck)
- Trivy (vulnerability scanning)
- TruffleHog (secret leak detection)
- Falco (runtime security monitoring via eBPF)
- Istio mTLS (service mesh security)
- cert-manager (self-signed/Let's Encrypt)
- Teleport agent (secure access)
- Authentik (identity provider)
Clusters are accessible via the cluster-secrets Kubernetes secret deployed to namespaces with label cluster-secrets=true. This secret contains:
KUBECONFIG: Combined kubeconfig for all clusters- All other secrets from secrets/common/cluster-secret-store/secrets/
Synchronization Flow:
- SOPS-encrypted secrets stored in secrets/ directory
make plan/applyruns clusters/load_secrets.py to decrypt secrets toclusters/tmp/secrets.json- OpenTofu processes secrets via
locals.tfand passes to External Secrets module - External Secrets module creates ExternalSecret objects referencing Vault paths
- External Secrets Operator syncs from Vault to Kubernetes secrets
- Namespaces with
cluster-secrets=truelabel automatically receivecluster-secretssecret
Automatic Secret Updates: When secrets are added/removed in code:
- OpenTofu detects changes and updates ExternalSecret objects
- External Secrets Operator automatically propagates changes to all clusters
- No manual intervention required - fully automated across all environments
Automated Workflow (triggered on cluster changes):
- CI/CD detects cluster changes via output comparison (before/after apply)
- Waits up to 30 minutes for Cluster API cluster to reach "Available" status
- clusters/scripts/update_kubeconfig_sops.sh extracts kubeconfig from Cluster API secret
- Merges with existing kubeconfig in secrets/common/cluster-secret-store/secrets/KUBECONFIG.yaml
- Git commit: "chore: Update KUBECONFIG with Cluster API clusters"
- Syncs to Vault via
make apply ENV=toolz TARGET='module.vault[0]' - External Secrets propagates to all clusters
Manual Recovery (if automation fails):
./clusters/scripts/update_kubeconfig_sops.sh \
--cluster-name <cluster> \
--namespace <cluster> \
--management-context clustermgmt- Add cluster configuration to clusters/variables.tf under
clustermgmtworkspace:
clustermgmt = {
kubernetes-cluster = [
{
cluster_type = "talos" # talos | kubeadm | rke2 | k3s
name = "new-cluster"
kubernetes_version = "v1.33.0"
control_plane_endpoint_ip = "192.168.1.80"
ip_range_start = "192.168.1.81"
ip_range_end = "192.168.1.89"
gateway = "192.168.1.1"
prefix = 24
dns_servers = ["192.168.1.3", "8.8.4.4"]
source_node = "node03"
template_id = 9005
allowed_nodes = ["node03"]
cp_replicas = 1
wk_replicas = 2
cp_disk_size = 20
cp_memory = 4096
cp_cores = 4
wk_disk_size = 30
wk_memory = 8192
wk_cores = 8
}
]
}- Commit and push to trigger CI/CD
- CI/CD automatically provisions cluster and updates kubeconfig in Vault
- Cluster immediately available to OpenTofu and CI/CD pipelines
- Add module name to workspace's
workloadlist in clusters/variables.tf:
variable "workload" {
default = {
dev = ["existing-module", "new-module"]
}
}- Add module configuration under workspace in
configvariable:
variable "config" {
default = {
dev = {
new-module = {
# module-specific settings
}
}
}
}- Verify module exists in clusters/modules.tf with conditional count:
module "new_module" {
count = contains(local.workload, "new-module") ? 1 : 0
source = "../modules/apps/new-module"
}- Run
make plan ENV=devto preview, thenmake apply ENV=devto deploy
The CKS (Certified Kubernetes Security) training platform runs on the toolz cluster:
Architecture:
- KubeVirt for VM virtualization on Kubernetes
- Longhorn for distributed storage with snapshot capability
- Pool of standby VMs ready for instant provisioning
Workflow:
- CKS maintains pool of standby VMs (running on KubeVirt)
- User starts a CKS scenario
- VM instantly provisioned from pool (no wait time)
- User opens terminal tabs (multi-terminal support via cks-terminal-mgmt + ttyd)
- User completes scenario exercises
- CKS triggers Longhorn snapshot restore
- VM reset to clean state and returned to pool
- Rapid reset enables high-throughput scenario execution
Terminal Access:
- cks-terminal-mgmt runs on toolz alongside KubeVirt VMs
- Spawns ttyd processes on-demand for SSH connections to VMs
- Frontend embeds terminals via iframe with multi-tab support (multiple terminals per VM)
This architecture provides instant scenario availability and eliminates waiting for VM provisioning/cleanup.