Session 2026-06-14 — Research Update & Integration Design Review

Phase: Research Refresh + Implementation Planning

What was done

1. Project audit & asset verification

Reviewed all existing artifacts from prior phases:

Artifact	Path	Status	Date Created
Research doc	`research.md`	✅ Complete (SoulX-FlashHead selected)	Jun 5, 2026
Dockerfile	`deployment/Dockerfile`	✅ CUDA 12.4, Torch + Diffusers	May 7, 2026
Inference server	`deployment/inference_server.py`	✅ FastAPI POST /animate	May 7, 2026
K8s manifests	`deployment/k8s.yaml`	✅ Deployment + Service + Ingress	May 7, 2026
Reference photos	`/opt/data/creative/faces/png/`	✅ 14 PNGs (IMG_7985–7999)	May 11, 2026

2. Research refresh — model landscape update

Since the initial research on June 5, several new models and forks have emerged:

New notable entries since June 5:

Model	Type	License	Status	Notes
EchoMimicV4 (Ant Group)	Audio-driven + full expression	Apache 2.0	Pre-release May 2026	Successor to V3; claims 2× improvement on lip-sync FVD. No public HF weights yet. Not actionable.
MoDA v2 (OpenMOI)	Talking head	MIT	Active development	Improved streaming pipeline. Still ~45 FPS on 4090 vs FlashHead’s 96. Not a replacement.
Ditto-1B (Ant Group, fork of Ditto)	Audio-driven video	Apache 2.0	Released May 2026	General-purpose video gen, not specifically talking-head optimized. Overkill for single-face use case.

Decision: SoulX-FlashHead remains the best choice. No new model has matched its combination of speed (96 FPS), zero-shot identity preservation, and streaming-native design. EchoMimicV4 is worth watching but too early.

3. Inference server code review — issues found

Reviewed deployment/inference_server.py for correctness. Found several issues:

Import error: Uses from diffusers import WanModelAudioProject which doesn’t exist in current diffusers. SoulX-FlashHead uses its own pipeline class, not a standard diffusers one.
Pipeline call incorrect: pipe(image=..., audio=..., fps=...) — the actual API is different (see SoulX-FlashHead’s inference.py). The real call signature expects (reference_img, audio, output_path) and returns frames differently.
Missing ffmpeg encoding: Uses imageio_ffmpeg.write_frames() which isn’t a standard function; should use imageio.get_writer().

Status: Inference server needs rewriting before it can be used. The Dockerfile is structurally correct (dependencies look right), but the server code won’t run as-is.

4. K8s manifest review

K8s manifests are well-structured but need a PVC resource definition (soulx-flashhead-models claim referenced but not defined). Add a PersistentVolumeClaim to the YAML.

Remaining Gaps (unchanged)

No inference test run — model never tested end-to-end
Best reference photo not selected — 14 candidates, need visual assessment
K8s GPU node verification — confirm .106 is registered as GPU node
Audio format compatibility — F5-TTS output ↔ SoulX-FlashHead input (likely 16kHz mono WAV)

Next actions (Phase 2 priorities)

P1: Rewrite inference_server.py

The current server code has bugs and uses non-existent diffusers classes. Need to:

Clone SoulX-FlashHead repo to examine actual API surface
Write correct pipeline call using their native SoulXFlashHeadPipeline
Properly encode video output with ffmpeg/imageio

P2: Select reference photo

Review 14 PNGs visually, pick the best face for SoulX-FlashHead (needs clear frontal face, good lighting). This requires pvs’ input on which photo Mick wants to use as his “avatar image”.

P3: Build & test Docker image

Build the image locally or via GitLab CI, push to registry.gitlab.paralla.org/hermes/flashhead. Test with a sample WAV + ref PNG.

Session ended: 2026-06-14 ~13:00 UTC

Quartz 4

Explorer

2026-06-14-session