ADR 019: Hybrid GitOps and Multi-Cluster Architecture
Context
Phase 10 established GitOps maturity on the local K3s platform and Phase 11 moved Cloudflare controls into IaC.
The next maturity step is a hybrid Hub-and-Spoke model:
- Keep the local ArgoCD instance on Proxmox as the Hub control plane.
- Manage both local
in-clusterand remoteaws-eks-prodas destinations from one GitOps control plane. - Prove multi-cluster workload delivery by deploying
status-apiwith cloud-specific values on EKS.
Decision
We adopt a Hub-and-Spoke GitOps architecture:
- Hub: on-prem ArgoCD (Proxmox K3s) remains the management cluster.
- Spokes:
in-cluster(local workloads) andaws-eks-prod(remote cloud workloads). - ArgoCD Applications keep environment-specific destinations and values overlays while sharing one source-of-truth repository.
Feynman Check
Why does ArgoCD run on-prem and not inside EKS?
- Cost: a dedicated EKS management cluster would add fixed control-plane cost before any workload value is delivered.
- Management Cluster Pattern: the control plane is isolated from workload clusters to reduce coupling.
- Blast Radius: EKS incident risk is contained to the cloud spoke and does not remove the reconciliation engine itself.
- Disaster Recovery: if AWS is degraded, the on-prem hub remains reachable and able to manage local services.
What happens to GitOps when cloud is down?
- Local GitOps continues:
in-clusterreconciliation remains active because ArgoCD and K3s control plane are local. - Remote GitOps pauses:
aws-eks-prodsync attempts fail or become degraded until cloud API recovery. - Platform continuity is preserved for local services because the Hub does not depend on EKS runtime availability.
Service Classification
| Service | Classification | Exposure Model | Primary Destination |
|---|---|---|---|
status-api-lab |
Public | Ingress + DNS (lab.northlift.net) |
in-cluster |
status-api-cloud |
Public | Ingress + DNS (aws.northlift.net) |
aws-eks-prod |
argocd-server |
Tunnel-only | Cloudflare Tunnel + Access | in-cluster |
redis |
Internal-only | ClusterIP only, no external ingress | in-cluster / aws-eks-prod |
Architecture Diagram
graph LR
Git[(Git Repository)] --> Hub[ArgoCD Hub\nProxmox K3s]
Hub --> Local[in-cluster\nLocal K3s]
Hub --> EKS[aws-eks-prod\nAWS EKS]
Local --> LocalApps[Platform + Lab Apps]
EKS --> CloudApps[Platform + Cloud Apps]
Operational Registration Flow
aws eks update-kubeconfig \
--name infrastructure-lab-eks-prod \
--region eu-central-1 \
--alias aws-eks-prod
argocd cluster add aws-eks-prod --name aws-eks-prod --grpc-web
argocd cluster list
Consequences
Positive
- One control plane governs two Kubernetes targets with consistent Git workflows.
- Cloud rollout does not require moving or rebuilding the existing ArgoCD control plane.
- Local operations remain available during cloud outages.
- Cluster-specific policy and capacity can diverge without splitting repositories.
Negative
- Cross-environment release coordination becomes more complex (different health states per destination).
- ArgoCD must maintain secure credentials for a remote cluster.
- Remote destination failures can increase noise in ArgoCD alerts.
- A documented bootstrap/runbook is mandatory for cluster registration and teardown.
Status
Accepted and implemented in Phase 12.