Daily Bill Scan — Review (2026-06-22)

What it is

Script /opt/data/bin/bill-scanner.py runs daily via autopilot.
Scans IMAP inbox (INBOX, [Gmail]/All Mail) for emails matching bill keywords (bill, invoice, payment, etc.).
Downloads attachments, converts PDFs to images via PyMuPDF, sends images to inference server (.106:8080) for OCR extraction of vendor/amount/due-date.
Results logged to /opt/data/bills/scan-log.jsonl, OCR text saved under /opt/data/bills/processed/.

Running — autopilot scans every ~10 min, processes on --process-attachments flag.
Last run 2026-06-10: 5 bill emails found, 0 new attachments (all duplicates). Pipeline healthy.
Active open items: Unitywater $493.71 d u e 26 J u n; L a w n f ee d C o .$ 125 new vendor (needs verification).
PyMuPDF fixed in venv (/opt/data/.venv/) — shebang corrected to #!/opt/data/.venv/bin/python3.
~20 sessions logged since May 2026, mostly “no new bills” cycles.

No deduplication — scanner re-scans all emails every tick; no record of already-processed message IDs in a durable store. Duplicate detection is implicit (checking if attachments dir already has files for msg_id).
No alerting/notification — extracted bills just sit in /opt/data/bills/processed/. pvs never gets notified unless autopilot logs are manually checked.
No due-date tracking — bills with upcoming deadlines (Unitywater 26 Jun) not fed into any task/calendar system. Just noted in index.md.
Test/dummy data contamination — PowerCo NZ test files and Superloop “test_421” still on disk from May 5 run. Not cleaned up.
Single keyword list — KEYWORDS list is short; bills with creative subject lines could slip through (e.g., “statement”, “remittance”, “utility”).
No structured output schema — OCR results are free-form text blobs. No consistent JSON structure for downstream automation.
Dependent on GPU node — .106:8080 is the single point of failure for OCR. No fallback.

Keep the existing scanner as-is (it works). Layer improvements without rewriting.
Add a processed-messages store (/opt/data/bills/scanned_ids.json) to skip re-download.
Route structured bill data into Mercury tasks so pvs gets notified on due dates.
Clean up test data once and document what to keep vs delete.

Phase 1 — hygiene (immediate):

Phase 2 — notification (short term):

Parse OCR output into structured JSON (vendor, amount, due date, bill number).
Create Mercury task when a new bill is detected or a due date is approaching within 3 days.
Add a weekly summary email to pvs.

Phase 3 — calendar integration (medium term):