Multi-model tiering: best model for planning/design/audit, carnice for build, mercury for infra/admin
Design for routing task types to different models/agents, and for serving a model far larger than the 4090’s VRAM (MiniMax-3 first, Kimi-2.6 later) by splitting it across GPU + system RAM + NVMe. Nothing here is implemented yet.
Decisions (pvs, 2026-06-19) + prep already actioned
- #1 GPUs: no 2nd GPU, but OK to split across all available GPUs if it helps. Reality: the 4090 is the only capable card; cosmos’s 980+1050Ti and .193’s 1050Ti are 4GB and on other boxes, so multi-GPU would need llama.cpp RPC and buys little for the big MoE. Core strategy stays 4090 (attention) + RAM + NVMe; RPC-offloading a few expert layers to the weak GPUs is a marginal tuning option.
- #2 DONE: .107 (openclaw-k8s-2) drained, removed from the cluster, VM stopped → freed 16GB on host .191. Cluster is now single-node (.190). Added a hermes-ns LimitRange: default request 50m CPU / 64Mi, default limit 500m / 512Mi.
- #3: use the best-performance split (attention-on-GPU / experts-on-RAM+NVMe).
- #4: latest MiniMax-3 (not M2 / M2.7) at the highest quant that fits (best quality, speed irrelevant) — likely Q6_K/Q8 once disk/RAM allow.
- #5 DONE: .106 RAM 12 → 40GB and disk 200 → 800GB (fs auto-grown to 788G, 658G free) — ready to stage MiniMax-3 now and Kimi-2.6 later. jarvis/carnice reloaded fine after the reboot (23GB VRAM, healthy).
Still NOT implemented: the model serving itself (download MiniMax-3, llama-server with MoE offload, the architect agent, the routing layer) — design below.
0. The hard constraint that shapes the whole design
There is one capable GPU — the RTX 4090 (24GB) on .106, and it is already full (carnice-v2-27b-Q5 uses ~23.5/24GB). The box has only 12GB RAM and a 200GB SSD. cosmos (.189) has 1.8TB of disk but only weak 4GB GPUs. So the planner (big MoE) and the builder (carnice-27B) cannot both occupy the 4090 at once — they must time-share it. The operator tier (mercury) lives on cosmos and never touches the 4090, so it runs in parallel.
1. The three tiers (task-type → model → agent)
| Tier | Task types | Model | Where | Agent |
|---|---|---|---|---|
| Architect | planning, design, audit, review | MiniMax-3 (→ Kimi-2.6 later) | 4090 (.106) + RAM + NVMe | new hermes-architect agent |
| Builder | implementation / coding | carnice-v2-27b-Q5 (as now) | 4090 (.106) | hermes (current) |
| Operator | infra-in-kubes, email, admin | carnice-9b / Gemma | cosmos (.189) | mercury |
Standing rule (effective immediately in routing): any infra work inside
Kubernetes goes to mercury, never hermes. Encoded as a routing rule on
task-type infra.
Architect and Builder time-share the 4090; Operator (mercury) runs concurrently on cosmos. This matches the natural workflow: audit/plan (Architect) → implement (Builder) are sequential phases, so swapping the 4090 at phase boundaries is cheap relative to the work done in each phase.
2. Agent architecture — task-based routing across peer agents (researched)
Best practice for 2026 multi-model systems is task-based routing: send each task
to a domain-specialised agent, rather than one agent calling sub-agents as tools.
We already have peer agents (hermes, mercury, calliope) each with its own model
endpoint — so the clean fit is to add one more peer agent (hermes-architect)
and put a router in the queue/control-plane that dispatches by task type. This
beats the “supervisor + subagents-as-tools” pattern here because:
- each tier needs a different model endpoint (independent serving, independent GPU/RAM budgets) — peer agents give that natively;
- context isolation between planning and implementation is desirable;
- it reuses the existing per-agent deployment +
--source+ config pattern. (Subagents-as-tools would be right if one model spawned helpers sharing its context; that’s not our case.) Sources: Augment “AI model routing guide”, LangChain “choosing a multi-agent architecture”, Oreate “subagents vs multi-agent”.
3. Control-plane feature: choose the model per task type
Extends the existing tasks.paralla.org control plane.
- Task
typefield (planning | design | audit | review | implementation | infra | email | admin), inferred on creation and editable on the card. - Settings → “Model routing” page: a table mapping each task-type → tier → agent/model, fully editable (this is the “select which model works on which tasks” feature). Defaults = the table in §1.
- Per-task override: a dropdown on a task to force a specific tier/model.
- Queue dispatcher: when a task becomes active, route it to the mapped agent’s
API (
hermes-architect, hermes, or mercury). For 4090-contended tiers, the dispatcher coordinates with the 4090 model-manager (§4.3). - Stored in
wiki/control-plane/model-routing.json; served via/api/routing.
4. Serving MiniMax-3 on the 4090 (Phase 1) — GPU + RAM + NVMe split
MiniMax-M2-class: ~229B total, ~10B active, 256 experts (8 active/token). At Q4_K it is ~115GB on disk; with experts offloaded it needs only ~16GB VRAM + lots of RAM/NVMe for the expert weights.
4.1 Correct the layer placement (important)
The optimal split is the opposite of “MoE layers on GPU”:
- On GPU (4090): the attention + shared/dense layers + KV cache. They’re small, compute-heavy, and hit every token → huge benefit from the GPU. (~16-20GB)
- In system RAM (page cache): the MoE expert FFN weights — they’re the bulk of the params but sparse (only 8/256 experts run per token), so they tolerate being off-GPU. More RAM = more hot experts cached = faster.
- On NVMe (mmap): the full 115GB weight file lives on the SSD and is mmap’d; cold experts page in from disk on demand (the slow part — acceptable per “even if really slow”).
llama.cpp does exactly this via --n-cpu-moe N (keep N MoE layers on CPU) or
--override-tensor "\.ffn_.*_exps\.weight=CPU" (all experts to CPU), with the
model mmap’d from disk. Source: llama.cpp MoE-offload guide (Doctor-Shotgun),
--n-cpu-moe docs.
4.2 Fitting it on .106 — RAM/disk rebalance
- Disk: the 200GB SSD holds the ~115GB Q4 model (fits; ~85GB headroom). ✅ for MiniMax.
- RAM: .106 has 12GB today — far too little to cache experts. Rebalance on host .191 (62GB total): gaming-vm (101) is stopped → 32GB free, and reduce openclaw-k8s-2 (.107) 16→8GB. That allows .106 ≈ 12 → ~40GB (leaving agents-VM 4GB + k8s-2 8GB + ~8GB host). With ~40GB RAM, ~16GB on GPU, the page cache holds a large fraction of the hot experts; the remainder pages from the SSD. (More RAM later → fewer SSD page-ins → faster.)
- Verify before building: exact host headroom, and that reducing k8s-2 to 8GB is safe for its current pods (post-migration most stateful load is on .190).
4.3 The 4090 time-share (“model manager”)
A small model-manager service on .106 owns the 4090 and runs one big model at a time:
- Watches the queue’s active tier. If a Tier-1 (Architect) task is up and
carnice is loaded → stop carnice, start MiniMax (
llama-serverwith the MoE offload flags). If a Tier-2 (Builder) task is up → reverse. - Swaps happen at phase boundaries (audit/plan → implement), so they’re rare; load cost (unload 23GB / load 16GB GPU + warm mmap) is amortised over a batch.
- Exposed as a “4090 mode: Architect / Builder” control in the dashboard (auto by default, manual override available), reusing the loop-control pattern.
- Alternative considered: co-resident (carnice-Q4 ~14GB + MiniMax-attention ~8GB ≈ 22GB) to avoid swaps — rejected for v1: too tight with KV caches and degrades carnice quality. Revisit if a 2nd GPU appears.
4.4 Expected performance
With experts mostly on CPU/NVMe, MiniMax-3 will run at a few tokens/sec (single digits), dominated by SSD/RAM expert reads. That’s fine for the Architect tier (planning/design/audit are infrequent, quality > speed — explicitly the goal).
5. Kimi-2.6 (Phase 2, deferred)
Kimi K2.6 = ~1T total / 32B active / 384 experts, ~340GB even at 2-bit. It won’t fit on .106’s 200GB SSD and needs far more RAM page-cache. Path: stage the weights on cosmos’s 1.8TB SSD and either (a) attach that disk to .106, or (b) serve Kimi from cosmos over the network with the 4090 doing attention (complex). Treat as a later phase once MiniMax-3 is proven; same offload technique, bigger everything.
6. Rollout phases
- Routing layer (no new model): add task
type, the Model-routing settings page, per-task override, and the dispatcher — wire infra→mercury and keep Architect→hermes (carnice) as a temporary stand-in. Ships value immediately. - MiniMax-3 serving: RAM/disk rebalance on .191; download Q4 GGUF to .106;
stand up
llama-serverwith--n-cpu-moe/--override-tensor; the model-manager- 4090 mode toggle; the
hermes-architectagent pointed at it.
- 4090 mode toggle; the
- Flip Architect tasks to MiniMax-3, measure quality/throughput, tune
--n-cpu-moeand RAM. - Kimi-2.6 once disk/RAM allow.
7. Status of the open questions
- ✅ 2nd GPU / k8s-2 / split / best-perf / disk — all resolved & prepped above.
- ⚠️ Block-volume redundancy — removing .107 dropped
hermes-data-blockto a single replica (.190 only); Longhorn can’t rebuild a 2nd replica with one node. Fallback = the stale RWXhermes-databackup. Re-add redundancy when a 2nd storage node returns, or accept single-replica + periodic backups. - ⛏ Remaining for Phase-2 build: confirm the exact latest MiniMax-3 release and the top quant that fits ~40GB RAM + 24GB VRAM + 788G disk (start Q4_K_M to validate, then push to Q6/Q8 for quality); decide RPC multi-GPU (worth wiring the weak GPUs?); and the 4090 time-share trigger (auto-by-queue vs manual mode).