Cluster Health — Architecture & Status Review

What the project is

Continuous monitoring + auto-remediation for the hermes K8s namespace. Detect and remediate:

  • CrashLoopBackOff pods
  • OOMKilled pods
  • Disk / Longhorn volume faults
  • Stale jobs (completed, failed, running past deadline)

Current state

AreaStatusNotes
Scope defined✅ Doneindex.md created Jun 18
First scan run✅ DoneSession 2026-06-09 — clean: 0 problem pods, 2/2 nodes Ready
Automated checks❌ Not startedt-002 “Decide which checks to automate” still todo
Remediation logic❌ NoneNo auto-remediation exists yet; manual scan only
Alerting / escalation❌ NoneNo Slack/Mercury integration for findings
Resource pressure trends❌ Not trackedSuggested in last session but not implemented

Gaps & risks

  1. No automation — health scans are one-off manual runs (via autopilot session). A pod could CrashLoopBackOff for hours without detection.

  2. Overlap with ops-janitor — ops-janitor handles stale-job cleanup and queue hygiene. Boundary between “health check” and “janitorial task” is fuzzy. Risk of duplicate effort or conflicting remediation actions.

  3. No remediation logic — even if a problem is detected, there’s nothing that does the actual fix (restart pod, scale down, delete stale job, etc). This makes it a monitoring-only project.

  4. No alerting — findings stay in session logs. No way to escalate to pvs or trigger Slack notifications when critical issues surface.

  5. Limited scope — only checks hermes namespace. Doesn’t cover gitlab, monitoring, or other namespaces where problems also matter.

Phase 1: Design & boundary clarity (metis)

  • Define clear separation from ops-janitor: cluster-health = detection + alerting; ops-janitor = janitorial cleanup actions.
  • Decide on check intervals: crash-loop/OOMKilled every 5 min, disk/Longhorn every 30 min, stale jobs every hour.
  • Draft remediation playbook: what action to take for each condition type (e.g., CrashLoopBackOff → restart pod + alert; OOMKilled → alert only; disk fault → alert).

Phase 2: Implement cron-based scanner (apollo)

  • Write a Python script that runs kubectl checks across configured namespaces.
  • Structured output (JSON) per check: condition, affected resources, severity.
  • Integrate with Mercury for task creation on critical findings.
  • Deploy as CronJob or wire into existing hermes-autopilot tick.

Phase 3: Alerting & escalation (mercury)

  • Configure Slack notifications via @slack_bot for Critical + High severity.
  • Wire to Mercury wiki_task_create for persistent tracking.
  • Add dedup: don’t re-alert on same issue within 4h window.