Session 1 — 2026-06-05
Task ID
t_b3774ddc — synthetic-avatar: Build Mick talking-head avatar pipeline
Board: voice-pipeline (default board)
Status: ready → research complete, next phase TBD
What was done
1. Kanban setup
- Created task card
t_b3774ddcon the default board with body describing the project scope: “Add realistic animated avatar of Mick (Mick’s voice clone) to F5-TTS output. Audio in → video out.” - Model selection was noted as TBD — evaluating SadTalker, LivePortrait, Wav2Lip, DiffTalk, SyncTalk
2. Asset audit
Confirmed existing assets:
| Asset | Location | Status |
|---|---|---|
| Reference photos | /opt/data/creative/faces/png/ (14 PNG files, IMG_7985–IMG_7999) | ✅ Verified |
| Dockerfile | deployment/Dockerfile | ✅ Built for SoulX-FlashHead + CUDA 12.4 |
| Inference server | deployment/inference_server.py | ✅ FastAPI, POST /animate (audio + ref image → webm) |
| K8s manifests | deployment/k8s.yaml | ✅ Deployment (GPU nodeSelector), Service, Ingress flashhead.hermes.paralla.org |
3. Model research
Evaluated talking-head models available as of June 2026 for inference on RTX 4090 (.106):
| Model | License | FPS (RTX 4090) | Lip-sync | Streaming? | Verdict |
|---|---|---|---|---|---|
| SoulX-FlashHead | Apache 2.0 ✅ | 96 FPS (Lite) | Excellent, zero drift | ✅ Frame-by-frame via Temporal Audio Context Cache | Selected — already selected in prior sessions, all assets match this choice |
| LivePortrait | Apache 2.0 ✅ | ~30-60 (image→anim) | Not audio-driven natively | ❌ Batch only | Good for general animation but not audio lip-sync focus |
| Wav2Lip-HD | NC / CC-BY-NC-SA ⚠️ | ~30 (lips only) | Very good, lips only | ❌ Batch | Non-commercial license — blocker; limited to mouth region |
| DiffTalk | Unclear ⚠️ | Unknown | Good (per paper) | ❌ Unclear | Insufficient data, no production path |
| SyncTalk | Research / unclear ⚠️ | Unknown | Moderate | ❌ Batch only | Not production-ready |
Decision: Keep SoulX-FlashHead as selected. 96 FPS on RTX 4090 enables real-time streaming avatar. All deployment artifacts already reference it.
4. Integration design notes
Wrote research.md with pipeline architecture:
Text → F5-TTS (.193) → Audio WAV → SoulX-FlashHead (.106, RTX 4090) → WebM Video Stream
↑
Reference Image (from /opt/data/creative/faces/png/)
Two modes available:
- Batch: full clip generation (~2-5× real-time), simple request/response — start here for POC
- Streaming (frame-by-frame): ~100-500ms latency, requires Temporal Audio Context Cache — iterate after batch works
Remaining gaps
- No actual inference test logged (model installed but not tested)
- Best reference photo not selected from 14 candidates
- GPU K8s node registration for
.106needs verification (nvidia.com/gpu resource) - F5-TTS output format compatibility with SoulX-FlashHead input TBD
Next phase: Implementation
Phase 2 should focus on:
- Test inference on .106 (SSH or kubectl exec)
- Select best reference photo
- Verify audio format compatibility
- Deploy minimal test instance