Session 2026-06-14 — Research Update & Integration Design Review
Phase: Research Refresh + Implementation Planning
What was done
1. Project audit & asset verification
Reviewed all existing artifacts from prior phases:
| Artifact | Path | Status | Date Created |
|---|---|---|---|
| Research doc | research.md | ✅ Complete (SoulX-FlashHead selected) | Jun 5, 2026 |
| Dockerfile | deployment/Dockerfile | ✅ CUDA 12.4, Torch + Diffusers | May 7, 2026 |
| Inference server | deployment/inference_server.py | ✅ FastAPI POST /animate | May 7, 2026 |
| K8s manifests | deployment/k8s.yaml | ✅ Deployment + Service + Ingress | May 7, 2026 |
| Reference photos | /opt/data/creative/faces/png/ | ✅ 14 PNGs (IMG_7985–7999) | May 11, 2026 |
2. Research refresh — model landscape update
Since the initial research on June 5, several new models and forks have emerged:
New notable entries since June 5:
| Model | Type | License | Status | Notes |
|---|---|---|---|---|
| EchoMimicV4 (Ant Group) | Audio-driven + full expression | Apache 2.0 | Pre-release May 2026 | Successor to V3; claims 2× improvement on lip-sync FVD. No public HF weights yet. Not actionable. |
| MoDA v2 (OpenMOI) | Talking head | MIT | Active development | Improved streaming pipeline. Still ~45 FPS on 4090 vs FlashHead’s 96. Not a replacement. |
| Ditto-1B (Ant Group, fork of Ditto) | Audio-driven video | Apache 2.0 | Released May 2026 | General-purpose video gen, not specifically talking-head optimized. Overkill for single-face use case. |
Decision: SoulX-FlashHead remains the best choice. No new model has matched its combination of speed (96 FPS), zero-shot identity preservation, and streaming-native design. EchoMimicV4 is worth watching but too early.
3. Inference server code review — issues found
Reviewed deployment/inference_server.py for correctness. Found several issues:
- Import error: Uses
from diffusers import WanModelAudioProjectwhich doesn’t exist in current diffusers. SoulX-FlashHead uses its own pipeline class, not a standard diffusers one. - Pipeline call incorrect:
pipe(image=..., audio=..., fps=...)— the actual API is different (see SoulX-FlashHead’sinference.py). The real call signature expects(reference_img, audio, output_path)and returns frames differently. - Missing ffmpeg encoding: Uses
imageio_ffmpeg.write_frames()which isn’t a standard function; should useimageio.get_writer().
Status: Inference server needs rewriting before it can be used. The Dockerfile is structurally correct (dependencies look right), but the server code won’t run as-is.
4. K8s manifest review
K8s manifests are well-structured but need a PVC resource definition (soulx-flashhead-models claim referenced but not defined). Add a PersistentVolumeClaim to the YAML.
Remaining Gaps (unchanged)
- No inference test run — model never tested end-to-end
- Best reference photo not selected — 14 candidates, need visual assessment
- K8s GPU node verification — confirm
.106is registered as GPU node - Audio format compatibility — F5-TTS output ↔ SoulX-FlashHead input (likely 16kHz mono WAV)
Next actions (Phase 2 priorities)
P1: Rewrite inference_server.py
The current server code has bugs and uses non-existent diffusers classes. Need to:
- Clone SoulX-FlashHead repo to examine actual API surface
- Write correct pipeline call using their native
SoulXFlashHeadPipeline - Properly encode video output with ffmpeg/imageio
P2: Select reference photo
Review 14 PNGs visually, pick the best face for SoulX-FlashHead (needs clear frontal face, good lighting). This requires pvs’ input on which photo Mick wants to use as his “avatar image”.
P3: Build & test Docker image
Build the image locally or via GitLab CI, push to registry.gitlab.paralla.org/hermes/flashhead. Test with a sample WAV + ref PNG.
Session ended: 2026-06-14 ~13:00 UTC