Talking-Head Model Research — June 2026
Date: 2026-06-05
Agent: hermes
Kanban Task: t_b3774ddc (voice-pipeline board)
Objective
Select the best audio-driven talking-head model for animating Mick’s avatar, synced with F5-TTS voice output. Target inference on RTX 4090 (.106). Criteria: speed, quality, lip-sync accuracy, streaming capability, license.
Model Comparison Matrix
| Model | Repo | License | FPS (RTX 4090) | Lip-Sync Quality | Streaming? | Notes |
|---|---|---|---|---|---|---|
| SoulX-FlashHead | Soul-AILab/SoulX-FlashHead | Apache 2.0 ✅ | 96 FPS (Lite) | Excellent — zero drift, oracle-guided distillation | ✅ Frame-by-frame pipeline via Temporal Audio Context Cache | Already selected in prior sessions; Dockerfile + inference server + k8s manifests exist |
| LivePortrait | Kuaishou/LivePortrait | Apache 2.0 ✅ | ~30-60 FPS (image→anim) | Not audio-driven lip-sync natively — needs pairing with Wav2Lip or similar | ❌ Batch mode only | Great for general animation; poor fit for audio-driven talking head |
| Wav2Lip-HD | N/A | NC / CC-BY-NC-SA 4.0 ⚠️ | ~30 FPS (lip region only) | Very good lip sync, but limited to mouth area only | ❌ Batch processing | Non-commercial license — blocker for production use; also lacks head/face movement beyond lips |
| DiffTalk | N/A | Research / unclear ⚠️ | Unknown — no public benchmarks | Good per paper | ❌ Unclear | Insufficient data for decision; no clear production path |
| SyncTalk | N/A | Research / unclear ⚠️ | Unknown | Moderate — focuses on audio-video sync rather than realistic animation | ❌ Batch only | Still in research phase; not production-ready |
Additional Considerations
EchoMimic V3 (by Ant Group)
Not initially flagged as a priority, but worth noting:
- Audio-driven talking head with full facial expression control
- ~42 FPS on RTX 4090 (per their benchmarks)
- Open-source under Apache 2.0 (need to verify current state)
- More complex pipeline — requires face detection + landmark tracking before inference
Decision: Keep as backup option if SoulX-FlashHead has issues. Not pursuing for now.
Selection Decision: ✅ SoulX-FlashHead
Already selected in prior sessions (May 2026), and the decision still holds:
- Speed: 96 FPS on RTX 4090 enables real-time streaming avatar (critical for interactive use cases)
- Zero-shot identity preservation: Single reference image + audio → video with no per-subject training needed
- Streaming-native: Temporal Audio Context Cache supports frame-by-frame generation for low-latency pipeline
- License: Apache 2.0 — production-safe
- Infrastructure ready: Dockerfile, inference_server.py, k8s.yaml already built in
deployment/
Existing Assets
- ✅ 14 reference photos of Mick at
/opt/data/creative/faces/png/(IMG_7985–IMG_7999) - ✅ Dockerfile: Based on CUDA 12.4, installs SoulX-FlashHead + dependencies from source
- ✅ Inference server: FastAPI with POST
/animateendpoint — takes audio + ref image, returns webm stream - ✅ K8s manifests: Deployment (GPU nodeSelector), Service, Ingress for
flashhead.hermes.paralla.org
Integration Design Notes
Pipeline Architecture
Text → F5-TTS (.193) → Audio WAV → SoulX-FlashHead (.106, RTX 4090) → WebM Video Stream
↑
Reference Image (PNG/JPG from /opt/data/creative/faces/png/)
Options for Streaming vs Batch
| Mode | Latency | Use Case | Complexity |
|---|---|---|---|
| Batch | Audio duration + processing time (~2-5× real-time) | Pre-recorded videos, content generation | Low — simple request/response |
| Streaming (frame-by-frame) | ~100-500ms per frame (96 FPS target) | Interactive video calls, live presence | Medium — requires chunked audio input, temporal cache management |
Initial Recommendation
Start with batch mode for proof-of-concept, then iterate to streaming if interactive use cases emerge. Batch mode is simpler and leverages the existing inference server design directly.
Remaining Gaps
- No actual inference test — model has been installed (see Dockerfile) but no test run logged
- Reference image quality — 14 photos exist but none selected/verified as “best” for SoulX-FlashHead
- K8s GPU nodeSelector — need to verify that
.106is registered as a GPU-capable K8s node (nvidia.com/gpu resource) - Audio format compatibility — F5-TTS output format needs to match SoulX-FlashHead’s expected audio input
Research Update: June 14, 2026
New model candidates since June 5
| Model | Type | License | FPS (RTX 4090) | Status | Notes |
|---|---|---|---|---|---|
| EchoMimicV4 | Audio-driven + full expression | Apache 2.0 | ~45 FPS claimed | Pre-release May 2026 — no public weights yet | Successor to V3; claims 2× lip-sync FVD improvement. Not actionable yet. |
| MoDA v2 | Talking head streaming | MIT | ~45 FPS | Active development on OpenMOI | Improved pipeline, but still half the speed of FlashHead. Keep as backup. |
| Ditto-1B | General video gen (audio-driven) | Apache 2.0 | Unknown — heavy model | Released May 2026 | General-purpose, not talking-head optimized. Overkill for single-face use case. |
Decision: SoulX-FlashHead remains best choice
No new model has matched its combination of speed (96 FPS), zero-shot identity preservation, streaming-native design, and Apache 2.0 license. EchoMimicV4 is worth monitoring but too early for integration.
Inference server bugs found
Reviewed deployment/inference_server.py — several issues prevent it from running:
- Invalid import:
from diffusers import WanModelAudioProjectdoesn’t exist in current diffusers version - Wrong pipeline call signature: SoulX-FlashHead uses its own pipeline class, not a standard diffusers one
- Video encoding error: Uses non-existent
imageio_ffmpeg.write_frames()— should useimageio.get_writer()
Fix required: Server needs to be rewritten using the actual SoulX-FlashHead API surface from their repo. See session log for details.
K8s manifest gap
Missing PersistentVolumeClaim definition — the Deployment references claimName: soulx-flashhead-models but no PVC resource exists. Needs to be added to k8s.yaml.
Next Phase: Implementation
Phase 2 priorities:
- Rewrite inference_server.py using correct SoulX-FlashHead API
- Select best reference photo from 14 candidates (visual assessment needed)
- Build & test Docker image — push to registry.gitlab.paralla.org
- Verify K8s GPU node (.106 nvidia.com/gpu resource availability)
- Add PVC definition to k8s.yaml
- Test end-to-end pipeline: Text → F5-TTS WAV → FlashHead video