Cluster Health Session - 2026-06-23

Issue: hermes-memory-overcommit-check Init:Error (init container pip failure)

Task: d826c7f8

Root Cause: The hermes-memory-overcommit-check CronJob uses python:3.12-slim base image for its init container install-deps, which runs bare pip install -q --no-cache-dir kubernetes. The slim image ships with PEP 668 EXTERNAL_ENV set but no pip installed by default, causing the ModuleNotFoundError: No module named ‘pip’.

Investigation Summary:

  • Last schedule time: 2026-06-23T01:30:00Z (failed)
  • CronJob spec unchanged since creation 31d ago
  • Failed Job History Limit is 1, so failed jobs are cleaned up

Proposed Fix Options:

Option A - Use python3 -m ensurepip before pip install:

initContainers:
- command:
  - sh
  - -c
  - "python3 -m ensurepip --default-pip 2>/dev/null; python3 -m pip install -q --no-cache-dir kubernetes"
  image: python:3.12-slim

Option B - Switch to python:3.12-slim-bookworm which includes pip (but may be newer base): Same spec, different image tag.

Option C - Use get-pip.py from a trusted source:

initContainers:
- command:
  - sh
  - -c
  - "curl -sS https://bootstrap.pypa.io/get-pip.py | python3 && python3 -m pip install -q --no-cache-dir kubernetes"
  image: python:3.12-slim

Recommendation: Option A is safest — uses built-in ensurepip module, no external network dependency for get-pip, and keeps same base image.

Status: Blocked awaiting pvs sign-off per infrastructure policy (CronJob modification in hermes namespace). Waiting for explicit permission to apply fix.