Talking-Head Model Research — June 2026

Date: 2026-06-05
Agent: hermes
Kanban Task: t_b3774ddc (voice-pipeline board)

Objective

Select the best audio-driven talking-head model for animating Mick’s avatar, synced with F5-TTS voice output. Target inference on RTX 4090 (.106). Criteria: speed, quality, lip-sync accuracy, streaming capability, license.

Model Comparison Matrix

Model	Repo	License	FPS (RTX 4090)	Lip-Sync Quality	Streaming?	Notes
SoulX-FlashHead	Soul-AILab/SoulX-FlashHead	Apache 2.0 ✅	96 FPS (Lite)	Excellent — zero drift, oracle-guided distillation	✅ Frame-by-frame pipeline via Temporal Audio Context Cache	Already selected in prior sessions; Dockerfile + inference server + k8s manifests exist
LivePortrait	Kuaishou/LivePortrait	Apache 2.0 ✅	~30-60 FPS (image→anim)	Not audio-driven lip-sync natively — needs pairing with Wav2Lip or similar	❌ Batch mode only	Great for general animation; poor fit for audio-driven talking head
Wav2Lip-HD	N/A	NC / CC-BY-NC-SA 4.0 ⚠️	~30 FPS (lip region only)	Very good lip sync, but limited to mouth area only	❌ Batch processing	Non-commercial license — blocker for production use; also lacks head/face movement beyond lips
DiffTalk	N/A	Research / unclear ⚠️	Unknown — no public benchmarks	Good per paper	❌ Unclear	Insufficient data for decision; no clear production path
SyncTalk	N/A	Research / unclear ⚠️	Unknown	Moderate — focuses on audio-video sync rather than realistic animation	❌ Batch only	Still in research phase; not production-ready

Additional Considerations

EchoMimic V3 (by Ant Group)

Not initially flagged as a priority, but worth noting:

Audio-driven talking head with full facial expression control
~42 FPS on RTX 4090 (per their benchmarks)
Open-source under Apache 2.0 (need to verify current state)
More complex pipeline — requires face detection + landmark tracking before inference

Decision: Keep as backup option if SoulX-FlashHead has issues. Not pursuing for now.

Selection Decision: ✅ SoulX-FlashHead

Already selected in prior sessions (May 2026), and the decision still holds:

Speed: 96 FPS on RTX 4090 enables real-time streaming avatar (critical for interactive use cases)
Zero-shot identity preservation: Single reference image + audio → video with no per-subject training needed
Streaming-native: Temporal Audio Context Cache supports frame-by-frame generation for low-latency pipeline
License: Apache 2.0 — production-safe
Infrastructure ready: Dockerfile, inference_server.py, k8s.yaml already built in deployment/

Existing Assets

✅ 14 reference photos of Mick at /opt/data/creative/faces/png/ (IMG_7985–IMG_7999)
✅ Dockerfile: Based on CUDA 12.4, installs SoulX-FlashHead + dependencies from source
✅ Inference server: FastAPI with POST /animate endpoint — takes audio + ref image, returns webm stream
✅ K8s manifests: Deployment (GPU nodeSelector), Service, Ingress for flashhead.hermes.paralla.org

Integration Design Notes

Pipeline Architecture

Text → F5-TTS (.193) → Audio WAV → SoulX-FlashHead (.106, RTX 4090) → WebM Video Stream
                                          ↑
                                Reference Image (PNG/JPG from /opt/data/creative/faces/png/)

Options for Streaming vs Batch

Mode	Latency	Use Case	Complexity
Batch	Audio duration + processing time (~2-5× real-time)	Pre-recorded videos, content generation	Low — simple request/response
Streaming (frame-by-frame)	~100-500ms per frame (96 FPS target)	Interactive video calls, live presence	Medium — requires chunked audio input, temporal cache management

Initial Recommendation

Start with batch mode for proof-of-concept, then iterate to streaming if interactive use cases emerge. Batch mode is simpler and leverages the existing inference server design directly.

Remaining Gaps

No actual inference test — model has been installed (see Dockerfile) but no test run logged
Reference image quality — 14 photos exist but none selected/verified as “best” for SoulX-FlashHead
K8s GPU nodeSelector — need to verify that .106 is registered as a GPU-capable K8s node (nvidia.com/gpu resource)
Audio format compatibility — F5-TTS output format needs to match SoulX-FlashHead’s expected audio input

Research Update: June 14, 2026

New model candidates since June 5

Model	Type	License	FPS (RTX 4090)	Status	Notes
EchoMimicV4	Audio-driven + full expression	Apache 2.0	~45 FPS claimed	Pre-release May 2026 — no public weights yet	Successor to V3; claims 2× lip-sync FVD improvement. Not actionable yet.
MoDA v2	Talking head streaming	MIT	~45 FPS	Active development on OpenMOI	Improved pipeline, but still half the speed of FlashHead. Keep as backup.
Ditto-1B	General video gen (audio-driven)	Apache 2.0	Unknown — heavy model	Released May 2026	General-purpose, not talking-head optimized. Overkill for single-face use case.

Decision: SoulX-FlashHead remains best choice

No new model has matched its combination of speed (96 FPS), zero-shot identity preservation, streaming-native design, and Apache 2.0 license. EchoMimicV4 is worth monitoring but too early for integration.

Inference server bugs found

Reviewed deployment/inference_server.py — several issues prevent it from running:

Invalid import: from diffusers import WanModelAudioProject doesn’t exist in current diffusers version
Wrong pipeline call signature: SoulX-FlashHead uses its own pipeline class, not a standard diffusers one
Video encoding error: Uses non-existent imageio_ffmpeg.write_frames() — should use imageio.get_writer()

Fix required: Server needs to be rewritten using the actual SoulX-FlashHead API surface from their repo. See session log for details.

K8s manifest gap

Missing PersistentVolumeClaim definition — the Deployment references claimName: soulx-flashhead-models but no PVC resource exists. Needs to be added to k8s.yaml.

Next Phase: Implementation

Phase 2 priorities:

Rewrite inference_server.py using correct SoulX-FlashHead API
Select best reference photo from 14 candidates (visual assessment needed)
Build & test Docker image — push to registry.gitlab.paralla.org
Verify K8s GPU node (.106 nvidia.com/gpu resource availability)
Add PVC definition to k8s.yaml
Test end-to-end pipeline: Text → F5-TTS WAV → FlashHead video

Quartz 4

Explorer

Talking-Head Model Research — June 2026

Talking-Head Model Research — June 2026

Objective

Model Comparison Matrix

Additional Considerations

EchoMimic V3 (by Ant Group)

Selection Decision: ✅ SoulX-FlashHead

Existing Assets

Integration Design Notes

Pipeline Architecture

Options for Streaming vs Batch

Initial Recommendation

Remaining Gaps

Research Update: June 14, 2026

New model candidates since June 5

Decision: SoulX-FlashHead remains best choice

Inference server bugs found

K8s manifest gap

Next Phase: Implementation

Graph View

Table of Contents