Session Log — 2026-05-21 (Phase 4: Implementation — model re-evaluation)

Time: 13:30 UTC
Agent: hermes
Kanban task: t_72b8d8a8

Session Goal

Reconcile model selection discrepancy, verify current state of both SoulX-FlashHead and Ditto, determine if any newer models should be considered, and advance implementation planning.

Critical Finding: Model Selection Discrepancy

The project has a contradictory state:

SourceSelected ModelDate
research.md (Phase 1)Ditto (Ant Group) — “top recommendation” with TensorRT optimizationMay 12, 2026
index.md + deployment artifactsSoulX-FlashHead (Soul AI Lab) — Apache 2.0, 96 FPS on RTX 4090May 6–7, 2026

This means:

  1. Phase 1 research correctly identified Ditto as the best option
  2. Subsequent sessions switched to SoulX-FlashHead (possibly for real-time streaming capability — 96 FPS vs Ditto’s ~3-8s/min)
  3. All deployment artifacts are built for SoulX-FlashHead — Dockerfile, inference_server.py, k8s.yaml all reference it

Model Landscape Re-verification (May 21)

SoulX-FlashHead Status: ✅ Stable and Current

  • Stars increased from 734 → 794 (up 60 in ~2 weeks)
  • Latest news: March 9, 2026 — online demo on HuggingFace
  • No breaking changes since May selection
  • License: Apache 2.0 confirmed
  • Lite variant: 96 FPS on RTX 4090 (3 concurrent streams at 25+ FPS)
  • Dependencies mature: PyTorch 2.7.1, CUDA 12.8, FlashAttention 2.8.0

Ditto Status: ✅ Still viable but less suited for streaming

  • Stars: ~779 (stable)
  • TensorRT + ONNX engines available for A100 and Ampere+ GPUs
  • Streaming Hubert encoder exists but pipeline is more complex
  • Requires TensorRT 8.6.1 specifically — harder to maintain in containers
  • Inference speed: ~3-8 seconds per minute of audio (batch), not true real-time

New Models Discovered Since Research:

VividTalk (humanaigc.github.io/vivid-talk/) — One-shot audio-driven talking head

  • Recent paper, claims realistic lip-sync + expressive facial expression
  • Natural head poses — good for Mick’s avatar
  • Need to verify: License, inference speed on 4090, code availability

LPIPS-AttnWav2Lip (ArXiv 2602.00189) — Audio-driven lip synchronization

  • Generic approach, works in the wild
  • Likely too low-level for full talking-head pipeline

Decision: Keep SoulX-FlashHead as Primary Choice

Rationale:

  1. Deployment artifacts already built — Dockerfile, inference server, K8s manifests all reference SoulX-FlashHead. Switching to Ditto would require rebuilding all of these.
  2. Better for streaming use case — 96 FPS on RTX 4090 enables real-time interactive avatar, which is the stated goal (“96 FPS target” in project goals)
  3. Apache 2.0 license — same as Ditto, no concerns
  4. Active development — recent activity through March 2026
  5. Simpler dependency chain — standard PyTorch + diffusers vs TensorRT + ONNX custom engines

Blocker Status: Still Blocked on .106 SSH Access

Same blocker as May 19: Heres agent cannot SSH to 192.168.100.106 (RTX 4090 host). Public key not authorized.

What I Can Do Without .106 Access:

  1. ✅ Verify SoulX-FlashHead repository and checkpoint status (done)
  2. ✅ Reconcile model selection (done above)
  3. ✅ Update research.md to include SoulX-FlashHead as the actually-selected model
  4. ⏳ Prepare implementation checklist for when access is granted
  5. ❌ Cannot test inference — needs GPU on .106

What I Can Do With .106 Access:

  1. Clone SoulX-FlashHead repo to /opt/models/flashhead
  2. Download checkpoints (~5–8 GB from HuggingFace)
  3. Install dependencies (PyTorch 2.7.1 + CUDA 12.4, FlashAttention, etc.)
  4. Test inference with Mick’s reference photo + F5-TTS audio
  5. Measure FPS on RTX 4090
  6. Deploy as K8s service using existing artifacts

Work Done This Session

  1. Full project state audit — verified all files intact, no corruption
  2. Model landscape re-verification — SoulX-FlashHead still current, no breaking changes
  3. Discrepancy resolution — documented the Ditto vs SoulX-FlashHead split and decided to keep SoulX-FlashHead as primary
  4. Reference photos verified — 14 PNGs (IMG_7985–7999) in /opt/data/creative/faces/png/, ~42–62 KB each, converted from HEIC originals

Next Steps

  1. BLOCKER: pvs to authorize hermes SSH key on .106 — without this, Phase 4 cannot proceed

    • Key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPf3z9WAxgw6+lGdvuhAiV/kQ0AIAaD6T79gIx84wOtU hermes@hermes-agent-7965856958-5t6j8
    • Or provide alternative GPU access path (kubectl exec, manual setup)
  2. When unblocked: Run Phase 4 implementation plan

  3. Optional research: Evaluate VividTalk as potential alternative — verify license and code availability before investing time

What’s Next

Waiting on pvs to resolve .106 SSH access blocker. All artifacts and research are ready for immediate execution once GPU access is available.