Talking-Head Model Research — June 2026

Date: 2026-06-05
Agent: hermes
Kanban Task: t_b3774ddc (voice-pipeline board)

Objective

Select the best audio-driven talking-head model for animating Mick’s avatar, synced with F5-TTS voice output. Target inference on RTX 4090 (.106). Criteria: speed, quality, lip-sync accuracy, streaming capability, license.


Model Comparison Matrix

ModelRepoLicenseFPS (RTX 4090)Lip-Sync QualityStreaming?Notes
SoulX-FlashHeadSoul-AILab/SoulX-FlashHeadApache 2.0 ✅96 FPS (Lite)Excellent — zero drift, oracle-guided distillation✅ Frame-by-frame pipeline via Temporal Audio Context CacheAlready selected in prior sessions; Dockerfile + inference server + k8s manifests exist
LivePortraitKuaishou/LivePortraitApache 2.0 ✅~30-60 FPS (image→anim)Not audio-driven lip-sync natively — needs pairing with Wav2Lip or similar❌ Batch mode onlyGreat for general animation; poor fit for audio-driven talking head
Wav2Lip-HDN/ANC / CC-BY-NC-SA 4.0 ⚠️~30 FPS (lip region only)Very good lip sync, but limited to mouth area only❌ Batch processingNon-commercial license — blocker for production use; also lacks head/face movement beyond lips
DiffTalkN/AResearch / unclear ⚠️Unknown — no public benchmarksGood per paper❌ UnclearInsufficient data for decision; no clear production path
SyncTalkN/AResearch / unclear ⚠️UnknownModerate — focuses on audio-video sync rather than realistic animation❌ Batch onlyStill in research phase; not production-ready

Additional Considerations

EchoMimic V3 (by Ant Group)

Not initially flagged as a priority, but worth noting:

  • Audio-driven talking head with full facial expression control
  • ~42 FPS on RTX 4090 (per their benchmarks)
  • Open-source under Apache 2.0 (need to verify current state)
  • More complex pipeline — requires face detection + landmark tracking before inference

Decision: Keep as backup option if SoulX-FlashHead has issues. Not pursuing for now.


Selection Decision: ✅ SoulX-FlashHead

Already selected in prior sessions (May 2026), and the decision still holds:

  1. Speed: 96 FPS on RTX 4090 enables real-time streaming avatar (critical for interactive use cases)
  2. Zero-shot identity preservation: Single reference image + audio → video with no per-subject training needed
  3. Streaming-native: Temporal Audio Context Cache supports frame-by-frame generation for low-latency pipeline
  4. License: Apache 2.0 — production-safe
  5. Infrastructure ready: Dockerfile, inference_server.py, k8s.yaml already built in deployment/

Existing Assets

  • 14 reference photos of Mick at /opt/data/creative/faces/png/ (IMG_7985–IMG_7999)
  • Dockerfile: Based on CUDA 12.4, installs SoulX-FlashHead + dependencies from source
  • Inference server: FastAPI with POST /animate endpoint — takes audio + ref image, returns webm stream
  • K8s manifests: Deployment (GPU nodeSelector), Service, Ingress for flashhead.hermes.paralla.org

Integration Design Notes

Pipeline Architecture

Text → F5-TTS (.193) → Audio WAV → SoulX-FlashHead (.106, RTX 4090) → WebM Video Stream
                                          ↑
                                Reference Image (PNG/JPG from /opt/data/creative/faces/png/)

Options for Streaming vs Batch

ModeLatencyUse CaseComplexity
BatchAudio duration + processing time (~2-5× real-time)Pre-recorded videos, content generationLow — simple request/response
Streaming (frame-by-frame)~100-500ms per frame (96 FPS target)Interactive video calls, live presenceMedium — requires chunked audio input, temporal cache management

Initial Recommendation

Start with batch mode for proof-of-concept, then iterate to streaming if interactive use cases emerge. Batch mode is simpler and leverages the existing inference server design directly.


Remaining Gaps

  1. No actual inference test — model has been installed (see Dockerfile) but no test run logged
  2. Reference image quality — 14 photos exist but none selected/verified as “best” for SoulX-FlashHead
  3. K8s GPU nodeSelector — need to verify that .106 is registered as a GPU-capable K8s node (nvidia.com/gpu resource)
  4. Audio format compatibility — F5-TTS output format needs to match SoulX-FlashHead’s expected audio input

Research Update: June 14, 2026

New model candidates since June 5

ModelTypeLicenseFPS (RTX 4090)StatusNotes
EchoMimicV4Audio-driven + full expressionApache 2.0~45 FPS claimedPre-release May 2026 — no public weights yetSuccessor to V3; claims 2× lip-sync FVD improvement. Not actionable yet.
MoDA v2Talking head streamingMIT~45 FPSActive development on OpenMOIImproved pipeline, but still half the speed of FlashHead. Keep as backup.
Ditto-1BGeneral video gen (audio-driven)Apache 2.0Unknown — heavy modelReleased May 2026General-purpose, not talking-head optimized. Overkill for single-face use case.

Decision: SoulX-FlashHead remains best choice

No new model has matched its combination of speed (96 FPS), zero-shot identity preservation, streaming-native design, and Apache 2.0 license. EchoMimicV4 is worth monitoring but too early for integration.

Inference server bugs found

Reviewed deployment/inference_server.py — several issues prevent it from running:

  1. Invalid import: from diffusers import WanModelAudioProject doesn’t exist in current diffusers version
  2. Wrong pipeline call signature: SoulX-FlashHead uses its own pipeline class, not a standard diffusers one
  3. Video encoding error: Uses non-existent imageio_ffmpeg.write_frames() — should use imageio.get_writer()

Fix required: Server needs to be rewritten using the actual SoulX-FlashHead API surface from their repo. See session log for details.

K8s manifest gap

Missing PersistentVolumeClaim definition — the Deployment references claimName: soulx-flashhead-models but no PVC resource exists. Needs to be added to k8s.yaml.


Next Phase: Implementation

Phase 2 priorities:

  1. Rewrite inference_server.py using correct SoulX-FlashHead API
  2. Select best reference photo from 14 candidates (visual assessment needed)
  3. Build & test Docker image — push to registry.gitlab.paralla.org
  4. Verify K8s GPU node (.106 nvidia.com/gpu resource availability)
  5. Add PVC definition to k8s.yaml
  6. Test end-to-end pipeline: Text → F5-TTS WAV → FlashHead video