Session 1 — 2026-06-05

Task ID

t_b3774ddc — synthetic-avatar: Build Mick talking-head avatar pipeline

Board: voice-pipeline (default board)
Status: ready → research complete, next phase TBD

What was done

1. Kanban setup

Created task card t_b3774ddc on the default board with body describing the project scope: “Add realistic animated avatar of Mick (Mick’s voice clone) to F5-TTS output. Audio in → video out.”
Model selection was noted as TBD — evaluating SadTalker, LivePortrait, Wav2Lip, DiffTalk, SyncTalk

2. Asset audit

Confirmed existing assets:

Asset	Location	Status
Reference photos	`/opt/data/creative/faces/png/` (14 PNG files, IMG_7985–IMG_7999)	✅ Verified
Dockerfile	`deployment/Dockerfile`	✅ Built for SoulX-FlashHead + CUDA 12.4
Inference server	`deployment/inference_server.py`	✅ FastAPI, POST /animate (audio + ref image → webm)
K8s manifests	`deployment/k8s.yaml`	✅ Deployment (GPU nodeSelector), Service, Ingress flashhead.hermes.paralla.org

3. Model research

Evaluated talking-head models available as of June 2026 for inference on RTX 4090 (.106):

Model	License	FPS (RTX 4090)	Lip-sync	Streaming?	Verdict
SoulX-FlashHead	Apache 2.0 ✅	96 FPS (Lite)	Excellent, zero drift	✅ Frame-by-frame via Temporal Audio Context Cache	Selected — already selected in prior sessions, all assets match this choice
LivePortrait	Apache 2.0 ✅	~30-60 (image→anim)	Not audio-driven natively	❌ Batch only	Good for general animation but not audio lip-sync focus
Wav2Lip-HD	NC / CC-BY-NC-SA ⚠️	~30 (lips only)	Very good, lips only	❌ Batch	Non-commercial license — blocker; limited to mouth region
DiffTalk	Unclear ⚠️	Unknown	Good (per paper)	❌ Unclear	Insufficient data, no production path
SyncTalk	Research / unclear ⚠️	Unknown	Moderate	❌ Batch only	Not production-ready

Decision: Keep SoulX-FlashHead as selected. 96 FPS on RTX 4090 enables real-time streaming avatar. All deployment artifacts already reference it.

4. Integration design notes

Wrote research.md with pipeline architecture:

Text → F5-TTS (.193) → Audio WAV → SoulX-FlashHead (.106, RTX 4090) → WebM Video Stream
                                          ↑
                                Reference Image (from /opt/data/creative/faces/png/)

Two modes available:

Batch: full clip generation (~2-5× real-time), simple request/response — start here for POC
Streaming (frame-by-frame): ~100-500ms latency, requires Temporal Audio Context Cache — iterate after batch works

Remaining gaps

No actual inference test logged (model installed but not tested)
Best reference photo not selected from 14 candidates
GPU K8s node registration for .106 needs verification (nvidia.com/gpu resource)
F5-TTS output format compatibility with SoulX-FlashHead input TBD

Next phase: Implementation

Phase 2 should focus on:

Test inference on .106 (SSH or kubectl exec)
Select best reference photo
Verify audio format compatibility
Deploy minimal test instance

Quartz 4

Explorer

Session 1 — Research & Kanban Setup