Cluster Health — Architecture & Status Review

What the project is

Continuous monitoring + auto-remediation for the hermes K8s namespace. Detect and remediate:

Area	Status	Notes
Scope defined	✅ Done	index.md created Jun 18
First scan run	✅ Done	Session 2026-06-09 — clean: 0 problem pods, 2/2 nodes Ready
Automated checks	❌ Not started	t-002 “Decide which checks to automate” still todo
Remediation logic	❌ None	No auto-remediation exists yet; manual scan only
Alerting / escalation	❌ None	No Slack/Mercury integration for findings
Resource pressure trends	❌ Not tracked	Suggested in last session but not implemented

No automation — health scans are one-off manual runs (via autopilot session). A pod could CrashLoopBackOff for hours without detection.
Overlap with ops-janitor — ops-janitor handles stale-job cleanup and queue hygiene. Boundary between “health check” and “janitorial task” is fuzzy. Risk of duplicate effort or conflicting remediation actions.
No remediation logic — even if a problem is detected, there’s nothing that does the actual fix (restart pod, scale down, delete stale job, etc). This makes it a monitoring-only project.
No alerting — findings stay in session logs. No way to escalate to pvs or trigger Slack notifications when critical issues surface.
Limited scope — only checks hermes namespace. Doesn’t cover gitlab, monitoring, or other namespaces where problems also matter.

Define clear separation from ops-janitor: cluster-health = detection + alerting; ops-janitor = janitorial cleanup actions.
Decide on check intervals: crash-loop/OOMKilled every 5 min, disk/Longhorn every 30 min, stale jobs every hour.
Draft remediation playbook: what action to take for each condition type (e.g., CrashLoopBackOff → restart pod + alert; OOMKilled → alert only; disk fault → alert).