Session 2026-06-14 — Research Update & Integration Design Review

Phase: Research Refresh + Implementation Planning

What was done

1. Project audit & asset verification

Reviewed all existing artifacts from prior phases:

ArtifactPathStatusDate Created
Research docresearch.md✅ Complete (SoulX-FlashHead selected)Jun 5, 2026
Dockerfiledeployment/Dockerfile✅ CUDA 12.4, Torch + DiffusersMay 7, 2026
Inference serverdeployment/inference_server.py✅ FastAPI POST /animateMay 7, 2026
K8s manifestsdeployment/k8s.yaml✅ Deployment + Service + IngressMay 7, 2026
Reference photos/opt/data/creative/faces/png/✅ 14 PNGs (IMG_7985–7999)May 11, 2026

2. Research refresh — model landscape update

Since the initial research on June 5, several new models and forks have emerged:

New notable entries since June 5:

ModelTypeLicenseStatusNotes
EchoMimicV4 (Ant Group)Audio-driven + full expressionApache 2.0Pre-release May 2026Successor to V3; claims 2× improvement on lip-sync FVD. No public HF weights yet. Not actionable.
MoDA v2 (OpenMOI)Talking headMITActive developmentImproved streaming pipeline. Still ~45 FPS on 4090 vs FlashHead’s 96. Not a replacement.
Ditto-1B (Ant Group, fork of Ditto)Audio-driven videoApache 2.0Released May 2026General-purpose video gen, not specifically talking-head optimized. Overkill for single-face use case.

Decision: SoulX-FlashHead remains the best choice. No new model has matched its combination of speed (96 FPS), zero-shot identity preservation, and streaming-native design. EchoMimicV4 is worth watching but too early.

3. Inference server code review — issues found

Reviewed deployment/inference_server.py for correctness. Found several issues:

  • Import error: Uses from diffusers import WanModelAudioProject which doesn’t exist in current diffusers. SoulX-FlashHead uses its own pipeline class, not a standard diffusers one.
  • Pipeline call incorrect: pipe(image=..., audio=..., fps=...) — the actual API is different (see SoulX-FlashHead’s inference.py). The real call signature expects (reference_img, audio, output_path) and returns frames differently.
  • Missing ffmpeg encoding: Uses imageio_ffmpeg.write_frames() which isn’t a standard function; should use imageio.get_writer().

Status: Inference server needs rewriting before it can be used. The Dockerfile is structurally correct (dependencies look right), but the server code won’t run as-is.

4. K8s manifest review

K8s manifests are well-structured but need a PVC resource definition (soulx-flashhead-models claim referenced but not defined). Add a PersistentVolumeClaim to the YAML.


Remaining Gaps (unchanged)

  1. No inference test run — model never tested end-to-end
  2. Best reference photo not selected — 14 candidates, need visual assessment
  3. K8s GPU node verification — confirm .106 is registered as GPU node
  4. Audio format compatibility — F5-TTS output ↔ SoulX-FlashHead input (likely 16kHz mono WAV)

Next actions (Phase 2 priorities)

P1: Rewrite inference_server.py

The current server code has bugs and uses non-existent diffusers classes. Need to:

  1. Clone SoulX-FlashHead repo to examine actual API surface
  2. Write correct pipeline call using their native SoulXFlashHeadPipeline
  3. Properly encode video output with ffmpeg/imageio

P2: Select reference photo

Review 14 PNGs visually, pick the best face for SoulX-FlashHead (needs clear frontal face, good lighting). This requires pvs’ input on which photo Mick wants to use as his “avatar image”.

P3: Build & test Docker image

Build the image locally or via GitLab CI, push to registry.gitlab.paralla.org/hermes/flashhead. Test with a sample WAV + ref PNG.


Session ended: 2026-06-14 ~13:00 UTC