Queue Infrastructure Observability — Daily Snapshot 2026-05-30
Executive Summary
| Metric | Value | Healthy? |
|---|---|---|
| Throughput (tasks done/24h) | 26 | ✅ OK |
| DLQ signals (24h) | 6 (api:3, crashloop:2, timeout:1) | ⚠️ Moderate — API failures persistent |
| Stuck tasks (>4h no mtime update) | 5 | ❌ ALERT — all 5 active items stuck |
| Blocked queue depth | 19 | ❌ HIGH — alert condition (>10) |
| “unknown” failure tags | 0 | ✅ Classifier stable |
Queue Depth
| State | Count |
|---|---|
| Done | 1,849 |
| Active | 5 (all stuck >4h) |
| Blocked | 19 |
| Queued | 0 |
Tick Health (last 24h in metrics.jsonl)
No new tick events since 2026-05-28T06:43. The emitter appears silent for ~37 hours. This is not unusual — the emitter has been intermittently down post-Phase-0. Filesystem-based scanning (this report’s method) compensates for this gap.
Last 3 recorded ticks:
| Timestamp | Worker | Outcome | Task |
|---|---|---|---|
| 2026-05-29T13:21 | hermes-autopilot-19094 | success | VT-011 personas (4fd3d183) |
| 2026-05-28T06:43 | hermes-autopilot-18252 | success | project-wiki-lint-daily |
| 2026-05-28T02:30 | hermes-autopilot-16567 | ambiguous | VT-011 personas (4fd3d183) |
Note: All recent ticks have duration_s: 0 and output_first_line: "test-tick" — these are stub-only ticks from the autopilot dispatcher placeholder, not real work output. Actual progress is happening in filesystem-level queue processing (wiki-cleanup tasks, VT-011 visual tests) but may not be properly captured by the emitter.
DLQ Scan (last 24h done files — 26 tasks)
| Failure signature | Count |
|---|---|
| API connection failure | 3 |
| CrashLoopBackOff / OOMKilled / BackOff | 2 |
| Timeout | 1 |
| HTTP 500 | 0 ✅ (was 146 on May 25) |
| Tool missing (exit 127) | 0 |
| ”unknown” tag | 0 ✅ (classifier still clean) |
DLQ Detail
- API failures (3): Tasks c1402cf2, f7a5a4d7 — both are wiki-cleanup tasks where the worker called out to an external API. Likely transient infra. Pattern: same as prior days, not escalating.
- CrashLoopBackOff (2): Tasks 5d5ed1fc and efcf52a8 (“investigate-image-pull-failures”) — pod crashloop tasks. Expected given ongoing K8s cluster issues.
- Timeout (1): d6959ebe — “delete-jobs-running-too-long-without-progress” task timed out.
Active Queue Stuck Tasks (>4h)
All 5 active items have been unmoved for >4 hours:
| Task | Age (hrs) | Status |
|---|---|---|
| d337453b-vt-011-personas-persona-list | ~26h | Stuck with consecutive_stuck=2, backoff bucket 90min. QT coaching given twice but not executed. |
| 02b0b4d0-address-crashloopbackoff-candidates | ~12h | Empty task body — only frontmatter |
| project-wiki-lint-daily | ~58h | Active project, last real progress May 29. Coaching notes present but no output in 24h. |
| wiki-cleanup-dupes-0527 | ~17h | Stalled |
| quartz-poc-crashloop-2026-05-25.stale | ~8d | Stale file, should be archived |
Blocked Queue — Bloat Alert (19 items, 12 >7d old)
The blocked queue has accumulated to 19 items, exceeding the alert threshold of 10. Items older than 14 days (6 tasks):
| Task | Age (days) |
|---|---|
| cluster-alert-kubepodnotready-62b52008 | 16.9d |
| hermes-04-tests-and-trust | 16.8d |
| legend-architecture-review-and-roadmap | 15.1d |
| legend-mattermost-keycloak-sso | 15.1d |
| legend-docs-qa-testing | 15.2d |
| f710f083 | 15.2d |
These need triage: either unblock (if reason resolved) or archive with note to pvs for manual review.
Handoff Loss Check
Previous handoff loss recorded in metrics on May 27: orphans.txt lost from /tmp to scripts dir. No fresh evidence today, but this is a structural gap — temp files are not persisted across pod restarts.
Infra Pod Status (filesystem proxy — kubectl unavailable)
- Autopilot workers: Active (latest worker ID 19094 on May 28)
- Queue-metrics-exporter: Not directly verifiable without kubectl, but JSONL file exists and was writable until May 28
- Qdrant: No OOMKilled evidence in task logs today
Action Items for pvs
- Blocked queue cleanup: 6 items >14 days old need manual triage (unblock or archive)
- Stub-only tick pattern: All ticks show
duration_s:0, output:test-tick— emitter may be recording placeholder events instead of real work. Investigate autopilot dispatcher logic. - Active stuck tasks: 5/5 active items stuck >4h. The VT-011 visual test has received coaching but not execution — consider escalating or closing stale coaching notes.