Queue Infrastructure Observability — Daily Snapshot 2026-05-30

Executive Summary

Metric	Value	Healthy?
Throughput (tasks done/24h)	26	✅ OK
DLQ signals (24h)	6 (api:3, crashloop:2, timeout:1)	⚠️ Moderate — API failures persistent
Stuck tasks (>4h no mtime update)	5	❌ ALERT — all 5 active items stuck
Blocked queue depth	19	❌ HIGH — alert condition (>10)
“unknown” failure tags	0	✅ Classifier stable

Queue Depth

State	Count
Done	1,849
Active	5 (all stuck >4h)
Blocked	19
Queued	0

Tick Health (last 24h in metrics.jsonl)

No new tick events since 2026-05-28T06:43. The emitter appears silent for ~37 hours. This is not unusual — the emitter has been intermittently down post-Phase-0. Filesystem-based scanning (this report’s method) compensates for this gap.

Last 3 recorded ticks:

Timestamp	Worker	Outcome	Task
2026-05-29T13:21	hermes-autopilot-19094	success	VT-011 personas (4fd3d183)
2026-05-28T06:43	hermes-autopilot-18252	success	project-wiki-lint-daily
2026-05-28T02:30	hermes-autopilot-16567	ambiguous	VT-011 personas (4fd3d183)

Note: All recent ticks have duration_s: 0 and output_first_line: "test-tick" — these are stub-only ticks from the autopilot dispatcher placeholder, not real work output. Actual progress is happening in filesystem-level queue processing (wiki-cleanup tasks, VT-011 visual tests) but may not be properly captured by the emitter.

DLQ Scan (last 24h done files — 26 tasks)

Failure signature	Count
API connection failure	3
CrashLoopBackOff / OOMKilled / BackOff	2
Timeout	1
HTTP 500	0 ✅ (was 146 on May 25)
Tool missing (exit 127)	0
”unknown” tag	0 ✅ (classifier still clean)

DLQ Detail

API failures (3): Tasks c1402cf2, f7a5a4d7 — both are wiki-cleanup tasks where the worker called out to an external API. Likely transient infra. Pattern: same as prior days, not escalating.
CrashLoopBackOff (2): Tasks 5d5ed1fc and efcf52a8 (“investigate-image-pull-failures”) — pod crashloop tasks. Expected given ongoing K8s cluster issues.
Timeout (1): d6959ebe — “delete-jobs-running-too-long-without-progress” task timed out.

Active Queue Stuck Tasks (>4h)

All 5 active items have been unmoved for >4 hours:

Task	Age (hrs)	Status
d337453b-vt-011-personas-persona-list	~26h	Stuck with consecutive_stuck=2, backoff bucket 90min. QT coaching given twice but not executed.
02b0b4d0-address-crashloopbackoff-candidates	~12h	Empty task body — only frontmatter
project-wiki-lint-daily	~58h	Active project, last real progress May 29. Coaching notes present but no output in 24h.
wiki-cleanup-dupes-0527	~17h	Stalled
quartz-poc-crashloop-2026-05-25.stale	~8d	Stale file, should be archived

Blocked Queue — Bloat Alert (19 items, 12 >7d old)

The blocked queue has accumulated to 19 items, exceeding the alert threshold of 10. Items older than 14 days (6 tasks):

Task	Age (days)
cluster-alert-kubepodnotready-62b52008	16.9d
hermes-04-tests-and-trust	16.8d
legend-architecture-review-and-roadmap	15.1d
legend-mattermost-keycloak-sso	15.1d
legend-docs-qa-testing	15.2d
f710f083	15.2d

These need triage: either unblock (if reason resolved) or archive with note to pvs for manual review.

Handoff Loss Check

Previous handoff loss recorded in metrics on May 27: orphans.txt lost from /tmp to scripts dir. No fresh evidence today, but this is a structural gap — temp files are not persisted across pod restarts.

Infra Pod Status (filesystem proxy — kubectl unavailable)

Autopilot workers: Active (latest worker ID 19094 on May 28)
Queue-metrics-exporter: Not directly verifiable without kubectl, but JSONL file exists and was writable until May 28
Qdrant: No OOMKilled evidence in task logs today

Action Items for pvs

Blocked queue cleanup: 6 items >14 days old need manual triage (unblock or archive)
Stub-only tick pattern: All ticks show duration_s:0, output:test-tick — emitter may be recording placeholder events instead of real work. Investigate autopilot dispatcher logic.
Active stuck tasks: 5/5 active items stuck >4h. The VT-011 visual test has received coaching but not execution — consider escalating or closing stale coaching notes.

Quartz 4

Explorer

Queue Infrastructure Observability — Daily Snapshot 2026-05-30

Queue Infrastructure Observability — Daily Snapshot 2026-05-30

Executive Summary

Queue Depth

Tick Health (last 24h in metrics.jsonl)

DLQ Scan (last 24h done files — 26 tasks)

DLQ Detail

Active Queue Stuck Tasks (>4h)

Blocked Queue — Bloat Alert (19 items, 12 >7d old)

Handoff Loss Check

Infra Pod Status (filesystem proxy — kubectl unavailable)

Action Items for pvs

Graph View

Table of Contents