Daily Bill Scan — Review (2026-06-22)
What it is
- Script
/opt/data/bin/bill-scanner.pyruns daily via autopilot. - Scans IMAP inbox (
INBOX,[Gmail]/All Mail) for emails matching bill keywords (bill,invoice,payment, etc.). - Downloads attachments, converts PDFs to images via PyMuPDF, sends images to inference server (
.106:8080) for OCR extraction of vendor/amount/due-date. - Results logged to
/opt/data/bills/scan-log.jsonl, OCR text saved under/opt/data/bills/processed/.
Current state
- Running — autopilot scans every ~10 min, processes on
--process-attachmentsflag. - Last run 2026-06-10: 5 bill emails found, 0 new attachments (all duplicates). Pipeline healthy.
- Active open items: Unitywater 125 new vendor (needs verification).
- PyMuPDF fixed in venv (
/opt/data/.venv/) — shebang corrected to#!/opt/data/.venv/bin/python3. - ~20 sessions logged since May 2026, mostly “no new bills” cycles.
Gaps / risks
- No deduplication — scanner re-scans all emails every tick; no record of already-processed message IDs in a durable store. Duplicate detection is implicit (checking if attachments dir already has files for msg_id).
- No alerting/notification — extracted bills just sit in
/opt/data/bills/processed/. pvs never gets notified unless autopilot logs are manually checked. - No due-date tracking — bills with upcoming deadlines (Unitywater 26 Jun) not fed into any task/calendar system. Just noted in
index.md. - Test/dummy data contamination — PowerCo NZ test files and Superloop “test_421” still on disk from May 5 run. Not cleaned up.
- Single keyword list —
KEYWORDSlist is short; bills with creative subject lines could slip through (e.g., “statement”, “remittance”, “utility”). - No structured output schema — OCR results are free-form text blobs. No consistent JSON structure for downstream automation.
- Dependent on GPU node —
.106:8080is the single point of failure for OCR. No fallback.
Recommended approach
- Keep the existing scanner as-is (it works). Layer improvements without rewriting.
- Add a processed-messages store (
/opt/data/bills/scanned_ids.json) to skip re-download. - Route structured bill data into Mercury tasks so pvs gets notified on due dates.
- Clean up test data once and document what to keep vs delete.
Phased plan
Phase 1 — hygiene (immediate):
- Add processed ID tracking to avoid redundant downloads.
- Delete confirmed test/dummy files (PowerCo NZ, Superloop test_421).
- Expand keyword list with common billing terms.
Phase 2 — notification (short term):
- Parse OCR output into structured JSON (vendor, amount, due date, bill number).
- Create Mercury task when a new bill is detected or a due date is approaching within 3 days.
- Add a weekly summary email to pvs.
Phase 3 — calendar integration (medium term):
- Export due dates as iCal/VTODO or feed into Google Calendar.
- Auto-recurring bills: detect patterns, auto-create next-period tasks.