Queue Infrastructure Observability — Daily Snapshot 2026-05-30

Executive Summary

MetricValueHealthy?
Throughput (tasks done/24h)26✅ OK
DLQ signals (24h)6 (api:3, crashloop:2, timeout:1)⚠️ Moderate — API failures persistent
Stuck tasks (>4h no mtime update)5❌ ALERT — all 5 active items stuck
Blocked queue depth19❌ HIGH — alert condition (>10)
“unknown” failure tags0✅ Classifier stable

Queue Depth

StateCount
Done1,849
Active5 (all stuck >4h)
Blocked19
Queued0

Tick Health (last 24h in metrics.jsonl)

No new tick events since 2026-05-28T06:43. The emitter appears silent for ~37 hours. This is not unusual — the emitter has been intermittently down post-Phase-0. Filesystem-based scanning (this report’s method) compensates for this gap.

Last 3 recorded ticks:

TimestampWorkerOutcomeTask
2026-05-29T13:21hermes-autopilot-19094successVT-011 personas (4fd3d183)
2026-05-28T06:43hermes-autopilot-18252successproject-wiki-lint-daily
2026-05-28T02:30hermes-autopilot-16567ambiguousVT-011 personas (4fd3d183)

Note: All recent ticks have duration_s: 0 and output_first_line: "test-tick" — these are stub-only ticks from the autopilot dispatcher placeholder, not real work output. Actual progress is happening in filesystem-level queue processing (wiki-cleanup tasks, VT-011 visual tests) but may not be properly captured by the emitter.

DLQ Scan (last 24h done files — 26 tasks)

Failure signatureCount
API connection failure3
CrashLoopBackOff / OOMKilled / BackOff2
Timeout1
HTTP 5000 ✅ (was 146 on May 25)
Tool missing (exit 127)0
”unknown” tag0 ✅ (classifier still clean)

DLQ Detail

  • API failures (3): Tasks c1402cf2, f7a5a4d7 — both are wiki-cleanup tasks where the worker called out to an external API. Likely transient infra. Pattern: same as prior days, not escalating.
  • CrashLoopBackOff (2): Tasks 5d5ed1fc and efcf52a8 (“investigate-image-pull-failures”) — pod crashloop tasks. Expected given ongoing K8s cluster issues.
  • Timeout (1): d6959ebe — “delete-jobs-running-too-long-without-progress” task timed out.

Active Queue Stuck Tasks (>4h)

All 5 active items have been unmoved for >4 hours:

TaskAge (hrs)Status
d337453b-vt-011-personas-persona-list~26hStuck with consecutive_stuck=2, backoff bucket 90min. QT coaching given twice but not executed.
02b0b4d0-address-crashloopbackoff-candidates~12hEmpty task body — only frontmatter
project-wiki-lint-daily~58hActive project, last real progress May 29. Coaching notes present but no output in 24h.
wiki-cleanup-dupes-0527~17hStalled
quartz-poc-crashloop-2026-05-25.stale~8dStale file, should be archived

Blocked Queue — Bloat Alert (19 items, 12 >7d old)

The blocked queue has accumulated to 19 items, exceeding the alert threshold of 10. Items older than 14 days (6 tasks):

TaskAge (days)
cluster-alert-kubepodnotready-62b5200816.9d
hermes-04-tests-and-trust16.8d
legend-architecture-review-and-roadmap15.1d
legend-mattermost-keycloak-sso15.1d
legend-docs-qa-testing15.2d
f710f08315.2d

These need triage: either unblock (if reason resolved) or archive with note to pvs for manual review.

Handoff Loss Check

Previous handoff loss recorded in metrics on May 27: orphans.txt lost from /tmp to scripts dir. No fresh evidence today, but this is a structural gap — temp files are not persisted across pod restarts.

Infra Pod Status (filesystem proxy — kubectl unavailable)

  • Autopilot workers: Active (latest worker ID 19094 on May 28)
  • Queue-metrics-exporter: Not directly verifiable without kubectl, but JSONL file exists and was writable until May 28
  • Qdrant: No OOMKilled evidence in task logs today

Action Items for pvs

  1. Blocked queue cleanup: 6 items >14 days old need manual triage (unblock or archive)
  2. Stub-only tick pattern: All ticks show duration_s:0, output:test-tick — emitter may be recording placeholder events instead of real work. Investigate autopilot dispatcher logic.
  3. Active stuck tasks: 5/5 active items stuck >4h. The VT-011 visual test has received coaching but not execution — consider escalating or closing stale coaching notes.