Cluster Health — Architecture & Status Review
What the project is
Continuous monitoring + auto-remediation for the hermes K8s namespace. Detect and remediate:
- CrashLoopBackOff pods
- OOMKilled pods
- Disk / Longhorn volume faults
- Stale jobs (completed, failed, running past deadline)
Current state
| Area | Status | Notes |
|---|---|---|
| Scope defined | ✅ Done | index.md created Jun 18 |
| First scan run | ✅ Done | Session 2026-06-09 — clean: 0 problem pods, 2/2 nodes Ready |
| Automated checks | ❌ Not started | t-002 “Decide which checks to automate” still todo |
| Remediation logic | ❌ None | No auto-remediation exists yet; manual scan only |
| Alerting / escalation | ❌ None | No Slack/Mercury integration for findings |
| Resource pressure trends | ❌ Not tracked | Suggested in last session but not implemented |
Gaps & risks
-
No automation — health scans are one-off manual runs (via autopilot session). A pod could CrashLoopBackOff for hours without detection.
-
Overlap with ops-janitor — ops-janitor handles stale-job cleanup and queue hygiene. Boundary between “health check” and “janitorial task” is fuzzy. Risk of duplicate effort or conflicting remediation actions.
-
No remediation logic — even if a problem is detected, there’s nothing that does the actual fix (restart pod, scale down, delete stale job, etc). This makes it a monitoring-only project.
-
No alerting — findings stay in session logs. No way to escalate to pvs or trigger Slack notifications when critical issues surface.
-
Limited scope — only checks
hermesnamespace. Doesn’t cover gitlab, monitoring, or other namespaces where problems also matter.
Recommended approach
Phase 1: Design & boundary clarity (metis)
- Define clear separation from ops-janitor: cluster-health = detection + alerting; ops-janitor = janitorial cleanup actions.
- Decide on check intervals: crash-loop/OOMKilled every 5 min, disk/Longhorn every 30 min, stale jobs every hour.
- Draft remediation playbook: what action to take for each condition type (e.g., CrashLoopBackOff → restart pod + alert; OOMKilled → alert only; disk fault → alert).
Phase 2: Implement cron-based scanner (apollo)
- Write a Python script that runs
kubectlchecks across configured namespaces. - Structured output (JSON) per check: condition, affected resources, severity.
- Integrate with Mercury for task creation on critical findings.
- Deploy as CronJob or wire into existing hermes-autopilot tick.
Phase 3: Alerting & escalation (mercury)
- Configure Slack notifications via
@slack_botfor Critical + High severity. - Wire to Mercury
wiki_task_createfor persistent tracking. - Add dedup: don’t re-alert on same issue within 4h window.