Session Log — 2026-05-21 (Phase 4: Implementation — model re-evaluation)

Time: 13:30 UTC
Agent: hermes
Kanban task: t_72b8d8a8

Session Goal

Reconcile model selection discrepancy, verify current state of both SoulX-FlashHead and Ditto, determine if any newer models should be considered, and advance implementation planning.

Critical Finding: Model Selection Discrepancy

The project has a contradictory state:

Source	Selected Model	Date
`research.md` (Phase 1)	Ditto (Ant Group) — “top recommendation” with TensorRT optimization	May 12, 2026
`index.md` + deployment artifacts	SoulX-FlashHead (Soul AI Lab) — Apache 2.0, 96 FPS on RTX 4090	May 6–7, 2026

This means:

Phase 1 research correctly identified Ditto as the best option
Subsequent sessions switched to SoulX-FlashHead (possibly for real-time streaming capability — 96 FPS vs Ditto’s ~3-8s/min)
All deployment artifacts are built for SoulX-FlashHead — Dockerfile, inference_server.py, k8s.yaml all reference it

Model Landscape Re-verification (May 21)

SoulX-FlashHead Status: ✅ Stable and Current

Stars increased from 734 → 794 (up 60 in ~2 weeks)
Latest news: March 9, 2026 — online demo on HuggingFace
No breaking changes since May selection
License: Apache 2.0 confirmed
Lite variant: 96 FPS on RTX 4090 (3 concurrent streams at 25+ FPS)
Dependencies mature: PyTorch 2.7.1, CUDA 12.8, FlashAttention 2.8.0

Ditto Status: ✅ Still viable but less suited for streaming

Stars: ~779 (stable)
TensorRT + ONNX engines available for A100 and Ampere+ GPUs
Streaming Hubert encoder exists but pipeline is more complex
Requires TensorRT 8.6.1 specifically — harder to maintain in containers
Inference speed: ~3-8 seconds per minute of audio (batch), not true real-time

New Models Discovered Since Research:

VividTalk (humanaigc.github.io/vivid-talk/) — One-shot audio-driven talking head

Recent paper, claims realistic lip-sync + expressive facial expression
Natural head poses — good for Mick’s avatar
Need to verify: License, inference speed on 4090, code availability

LPIPS-AttnWav2Lip (ArXiv 2602.00189) — Audio-driven lip synchronization

Generic approach, works in the wild
Likely too low-level for full talking-head pipeline

Decision: Keep SoulX-FlashHead as Primary Choice

Rationale:

Deployment artifacts already built — Dockerfile, inference server, K8s manifests all reference SoulX-FlashHead. Switching to Ditto would require rebuilding all of these.
Better for streaming use case — 96 FPS on RTX 4090 enables real-time interactive avatar, which is the stated goal (“96 FPS target” in project goals)
Apache 2.0 license — same as Ditto, no concerns
Active development — recent activity through March 2026
Simpler dependency chain — standard PyTorch + diffusers vs TensorRT + ONNX custom engines

Blocker Status: Still Blocked on .106 SSH Access

Same blocker as May 19: Heres agent cannot SSH to 192.168.100.106 (RTX 4090 host). Public key not authorized.

What I Can Do Without .106 Access:

✅ Verify SoulX-FlashHead repository and checkpoint status (done)
✅ Reconcile model selection (done above)
✅ Update research.md to include SoulX-FlashHead as the actually-selected model
⏳ Prepare implementation checklist for when access is granted
❌ Cannot test inference — needs GPU on .106

What I Can Do With .106 Access:

Clone SoulX-FlashHead repo to /opt/models/flashhead
Download checkpoints (~5–8 GB from HuggingFace)
Install dependencies (PyTorch 2.7.1 + CUDA 12.4, FlashAttention, etc.)
Test inference with Mick’s reference photo + F5-TTS audio
Measure FPS on RTX 4090
Deploy as K8s service using existing artifacts

Work Done This Session

Full project state audit — verified all files intact, no corruption
Model landscape re-verification — SoulX-FlashHead still current, no breaking changes
Discrepancy resolution — documented the Ditto vs SoulX-FlashHead split and decided to keep SoulX-FlashHead as primary
Reference photos verified — 14 PNGs (IMG_7985–7999) in /opt/data/creative/faces/png/, ~42–62 KB each, converted from HEIC originals

Next Steps

BLOCKER: pvs to authorize hermes SSH key on .106 — without this, Phase 4 cannot proceed
- Key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPf3z9WAxgw6+lGdvuhAiV/kQ0AIAaD6T79gIx84wOtU hermes@hermes-agent-7965856958-5t6j8
- Or provide alternative GPU access path (kubectl exec, manual setup)
When unblocked: Run Phase 4 implementation plan
Optional research: Evaluate VividTalk as potential alternative — verify license and code availability before investing time

What’s Next

Waiting on pvs to resolve .106 SSH access blocker. All artifacts and research are ready for immediate execution once GPU access is available.

Quartz 4

Explorer

2026 05 21 Session1