Queue Observability Snapshot — 2026-06-24

Queue Depth

QueueCount
done446
active7
blocked6
queued1

Throughput (last 24h)

  • Tasks completed: 67 (healthy throughput)

DLQ Signals (last 24h, scanned files in done/)

SignalCountStatus
API connection failures12⚠️ elevated
HTTP 5000✅ normal
Timeouts0✅ normal
Tool missing (exit 127)0✅ normal
Unknown failure tag0✅ clean (classifier working)
CrashLoopBackOff1low
OOMKilled0✅ normal
Assertion failed0✅ normal
CI test timeout0✅ normal

Stuck Tasks (>4h idle in active/)

  • c1790169.md — 13.0h idle (decomposed parent not auto-archived)
  • fix-test-assertion-failed-20260623.md — 11.8h idle
  • fix-test-ci-timeout-20260623.md — 9.6h idle
  • crashloop-hermes-qdrant-2026-06-22.md — 5.8h idle

Blocked Queue (oldest first)

FileAgePriority
calliope-pvc-io-error-2026-06-18.md6.0d⚠️ >4d, needs pvs review
add-gpu-node-health-check-fallback-before-ocr-fa.md1.9d
b4ed689e-diagnose-hermes-agent-crashloopbackoff-27-restarts.md1.4d
investigate-stuck-tasks-2026-06-23.md0.9d
fix-infra-crashloopbackoff-20260623.md0.7d

Metrics File

  • Last modified: 3.3h ago (fresh)
  • Total lines: 540 (cumulative)

Anomalies

  1. API connection failures elevated — 12 in 24h vs baseline ~3/day. Not a spike per se, but above normal. Likely transient network blips or API gateway retries. Monitor next run.
  2. 4 stuck tasks in active queue — idle >4h without heartbeat. c1790169 is a decomposed parent that should have been archived to done/ per dispatcher convention. The other 3 are fix-it cards that may be waiting on human follow-up or external blockers.

Action Items

  • Low priority: Move c1790169.md from active→done (decomposed, children exist)
  • Monitor: API connection failure count next session — if >20, investigate infra
  • Stale blocked item: calliope-pvc-io-error at 6d old — flag for pvs review per skill convention