Memory Overcommit Alerting Thresholds
Rationale
Memory overcommit monitoring detects when the cluster is consuming too much memory relative to available capacity, risking OOMKill or eviction under a node failure scenario. Three-tier thresholds (info / warning / critical) let us respond proportionally.
Threshold Definitions
| Level | Trigger | Env var key | Default | Rationale |
|---|---|---|---|---|
| Warning | cluster memory utilisation > 70% | WARNING_THRESHOLD | 0.70 | Early signal that burst capacity is shrinking. No immediate action required but triggers a warning alert to hermes-alerts. |
| Critical | cluster memory utilisation > 80% of single-node failure tolerance | CRITICAL_THRESHOLD | 0.80 | If one node dies, remaining nodes may not have enough capacity to evacuate workloads. Triggers critical Slack + task creation. |
The critical threshold uses a “single-node failure tolerance” model:
critical_limit = (total_memory - largest_node_memory) * 0.80
For the current cluster (3 nodes × ~96 GB, largest node = 96 GB):
critical_limit = (288 - 32) * 0.80 = 204.8 GB ≈ 71% utilisation
The env var CRITICAL_THRESHOLD=0.80 represents the fraction of remaining capacity that triggers critical. The monitoring script combines this with live node topology to compute the actual byte threshold at runtime.
Configuration Storage
Thresholds are stored in a Kubernetes ConfigMap: memory-overcommit-thresholds (namespace: hermes).
- File:
/opt/data/scripts/k8s/memory-overcommit-configmap.yaml - K8s resource:
configmap/hermes/memory-overcommit-thresholds - Applied via:
kubectl apply -f /opt/data/scripts/k8s/memory-overcommit-configmap.yaml - Consumed by: the
hermes-memory-overcommit-checkCronJob env vars
How to update thresholds
- Edit the ConfigMap YAML file locally.
- Apply:
kubectl apply -f /opt/data/scripts/k8s/memory-overcommit-configmap.yaml - The next cron tick (every 30 min) picks up new values — no restart needed.
Monitoring Script Access
The CronJob reads thresholds from env vars injected by the ConfigMap reference in the container spec. The check script resolves them as:
warning_pct = float(os.environ.get("WARNING_THRESHOLD", "0.70"))
critical_pct = float(os.environ.get("CRITICAL_THRESHOLD", "0.80"))
cooldown_sec = int(os.environ.get("ALERT_COOLDOWN_SECONDS", "300"))Related
scripts/k8s/memory-overcommit-configmap.yaml— ConfigMap for thresholdsscripts/k8s/memory-overcommit-cronjob.yaml— CronJob definition- Task queue keys:
memory-overcommit-threshold-config,memory-overcommit-slack-alert,memory-overcommit-deploy-cronjob