Memory Overcommit Alerting Thresholds

Rationale

Memory overcommit monitoring detects when the cluster is consuming too much memory relative to available capacity, risking OOMKill or eviction under a node failure scenario. Three-tier thresholds (info / warning / critical) let us respond proportionally.

Threshold Definitions

LevelTriggerEnv var keyDefaultRationale
Warningcluster memory utilisation > 70%WARNING_THRESHOLD0.70Early signal that burst capacity is shrinking. No immediate action required but triggers a warning alert to hermes-alerts.
Criticalcluster memory utilisation > 80% of single-node failure toleranceCRITICAL_THRESHOLD0.80If one node dies, remaining nodes may not have enough capacity to evacuate workloads. Triggers critical Slack + task creation.

The critical threshold uses a “single-node failure tolerance” model:

critical_limit = (total_memory - largest_node_memory) * 0.80

For the current cluster (3 nodes × ~96 GB, largest node = 96 GB):

critical_limit = (288 - 32) * 0.80 = 204.8 GB ≈ 71% utilisation

The env var CRITICAL_THRESHOLD=0.80 represents the fraction of remaining capacity that triggers critical. The monitoring script combines this with live node topology to compute the actual byte threshold at runtime.

Configuration Storage

Thresholds are stored in a Kubernetes ConfigMap: memory-overcommit-thresholds (namespace: hermes).

  • File: /opt/data/scripts/k8s/memory-overcommit-configmap.yaml
  • K8s resource: configmap/hermes/memory-overcommit-thresholds
  • Applied via: kubectl apply -f /opt/data/scripts/k8s/memory-overcommit-configmap.yaml
  • Consumed by: the hermes-memory-overcommit-check CronJob env vars

How to update thresholds

  1. Edit the ConfigMap YAML file locally.
  2. Apply: kubectl apply -f /opt/data/scripts/k8s/memory-overcommit-configmap.yaml
  3. The next cron tick (every 30 min) picks up new values — no restart needed.

Monitoring Script Access

The CronJob reads thresholds from env vars injected by the ConfigMap reference in the container spec. The check script resolves them as:

warning_pct = float(os.environ.get("WARNING_THRESHOLD", "0.70"))
critical_pct = float(os.environ.get("CRITICAL_THRESHOLD", "0.80"))
cooldown_sec = int(os.environ.get("ALERT_COOLDOWN_SECONDS", "300"))
  • scripts/k8s/memory-overcommit-configmap.yaml — ConfigMap for thresholds
  • scripts/k8s/memory-overcommit-cronjob.yaml — CronJob definition
  • Task queue keys: memory-overcommit-threshold-config, memory-overcommit-slack-alert, memory-overcommit-deploy-cronjob