Session 1 — 2026-06-05

Task ID

t_b3774ddc — synthetic-avatar: Build Mick talking-head avatar pipeline

Board: voice-pipeline (default board)
Status: ready → research complete, next phase TBD


What was done

1. Kanban setup

  • Created task card t_b3774ddc on the default board with body describing the project scope: “Add realistic animated avatar of Mick (Mick’s voice clone) to F5-TTS output. Audio in → video out.”
  • Model selection was noted as TBD — evaluating SadTalker, LivePortrait, Wav2Lip, DiffTalk, SyncTalk

2. Asset audit

Confirmed existing assets:

AssetLocationStatus
Reference photos/opt/data/creative/faces/png/ (14 PNG files, IMG_7985–IMG_7999)✅ Verified
Dockerfiledeployment/Dockerfile✅ Built for SoulX-FlashHead + CUDA 12.4
Inference serverdeployment/inference_server.py✅ FastAPI, POST /animate (audio + ref image → webm)
K8s manifestsdeployment/k8s.yaml✅ Deployment (GPU nodeSelector), Service, Ingress flashhead.hermes.paralla.org

3. Model research

Evaluated talking-head models available as of June 2026 for inference on RTX 4090 (.106):

ModelLicenseFPS (RTX 4090)Lip-syncStreaming?Verdict
SoulX-FlashHeadApache 2.0 ✅96 FPS (Lite)Excellent, zero drift✅ Frame-by-frame via Temporal Audio Context CacheSelected — already selected in prior sessions, all assets match this choice
LivePortraitApache 2.0 ✅~30-60 (image→anim)Not audio-driven natively❌ Batch onlyGood for general animation but not audio lip-sync focus
Wav2Lip-HDNC / CC-BY-NC-SA ⚠️~30 (lips only)Very good, lips only❌ BatchNon-commercial license — blocker; limited to mouth region
DiffTalkUnclear ⚠️UnknownGood (per paper)❌ UnclearInsufficient data, no production path
SyncTalkResearch / unclear ⚠️UnknownModerate❌ Batch onlyNot production-ready

Decision: Keep SoulX-FlashHead as selected. 96 FPS on RTX 4090 enables real-time streaming avatar. All deployment artifacts already reference it.

4. Integration design notes

Wrote research.md with pipeline architecture:

Text → F5-TTS (.193) → Audio WAV → SoulX-FlashHead (.106, RTX 4090) → WebM Video Stream
                                          ↑
                                Reference Image (from /opt/data/creative/faces/png/)

Two modes available:

  • Batch: full clip generation (~2-5× real-time), simple request/response — start here for POC
  • Streaming (frame-by-frame): ~100-500ms latency, requires Temporal Audio Context Cache — iterate after batch works

Remaining gaps

  1. No actual inference test logged (model installed but not tested)
  2. Best reference photo not selected from 14 candidates
  3. GPU K8s node registration for .106 needs verification (nvidia.com/gpu resource)
  4. F5-TTS output format compatibility with SoulX-FlashHead input TBD

Next phase: Implementation

Phase 2 should focus on:

  1. Test inference on .106 (SSH or kubectl exec)
  2. Select best reference photo
  3. Verify audio format compatibility
  4. Deploy minimal test instance