ADR 022: Observability Strategy for Hybrid GitOps
Context
The platform now spans local K3s and AWS EKS with GitOps reconciliation, Zero Trust access, and IaC-driven lifecycle automation.
At this maturity level, we still lacked a full observability picture:
- No central metrics trend visibility for cluster and workload health.
- No centralized log aggregation across pods.
- No external uptime validation from outside our own infrastructure.
The primary constraint is hardware: local K3s runs on a Proxmox VM with limited memory.
Decision
We adopt a two-layer observability model:
- Whitebox (internal): Prometheus, Loki, and Grafana on local K3s, deployed via ArgoCD.
- Blackbox (external): CloudWatch Synthetics canary on AWS, managed by OpenTofu.
Capacity Rationale (4 GB to 6 GB)
Before deploying observability workloads, the K3s VM memory is increased from 4096 MB to 6144 MB.
Sizing rationale:
| Component | Estimated Memory |
|---|---|
| K3s platform baseline | ~1.6 Gi |
| Observability stack baseline | ~650 Mi |
| OS + safety buffer | ~400 Mi |
| Total required baseline | ~2.65 Gi |
This leaves approximately 3.5 Gi headroom on a 6 Gi VM for query spikes and maintenance operations while preserving the project's Efficiency First principle.
Lifecycle Coupling Decision
The CloudWatch canary lifecycle is coupled to EKS lifecycle:
- Canary resources are created only when
enable_eks = true. - Canary resources are destroyed when EKS is disabled or destroyed.
- Canary alerts are wired into the existing
infrastructure-lab-budget-alertsSNS topic for unified FinOps and reliability signal handling.
This avoids idle monitoring spend during EKS teardown windows and keeps cost control aligned with Phase 12 lifecycle operations.
Consequences
Positive
- Internal and external observability become explicit, versioned, and reproducible.
- Resource-constrained whitebox telemetry remains bounded by strict requests and limits.
- External uptime failures generate centralized notifications in existing SNS alert channels.
- Lifecycle coupling prevents orphaned canary spend during off periods.
Negative
- Additional operational complexity across Kubernetes, Cloudflare, and AWS observability services.
- Two monitoring surfaces require documentation and routine verification.
- One canary validates two endpoints per run, so endpoint-specific root-cause isolation may require log inspection.
Status
Accepted and implemented in Phase 13.