t-001 (stale-job sweep): DONE — scanned 76 jobs across namespaces (52 Complete, 19 Failed, ~10 Running). Identified stale failed jobs but did not apply any cleanup yet.
t-002 (prune done/ tasks): TODO — not started.
t-003 (seed index.md): DONE.
todos.json out of sync with index.md: todos still show status: todo and owner: hermes; t-001 should be marked done in JSON too.
Gaps / Risks
Stale jobs identified but not cleaned up — cluster cruising with 52 Complete + 19 Failed Jobs consuming etcd space.
No recurring cadence defined (one-off sweep, no CronJob/scheduler).
todos.json drifts from actual progress — unreliable as source of truth.
All tasks owned by “hermes” (now obsolete); need owner split across apollo/mercury/metis per policy.
No dry-run or approval gate documented before deletion.
Recommended approach
Fix data integrity first — sync todos.json to reflect actual done state, correct owners.
Execute remaining sweep — apply cleanup with dry-run output reviewed by pvs.
Automate — turn stale-job sweep into a scheduled CronJob with safe thresholds (age > N days, status filter).
Phased plan
Phase
Goal
Owner
1
Sync todos.json + index.md, correct owners
apollo
2
Run actual cleanup on stale failed jobs (dry-run → review → apply)
apollo
3
Implement done/ task pruner (30-day TTL)
apollo
4
Convert sweep to CronJob with alerting
mercury
5
Audit full scope — add CRD cleanup, orphaned PVCs if needed