Multi-model tiering: best model for planning/design/audit, carnice for build, mercury for infra/admin

Design for routing task types to different models/agents, and for serving a model far larger than the 4090’s VRAM (MiniMax-3 first, Kimi-2.6 later) by splitting it across GPU + system RAM + NVMe. Nothing here is implemented yet.

Decisions (pvs, 2026-06-19) + prep already actioned

#1 GPUs: no 2nd GPU, but OK to split across all available GPUs if it helps. Reality: the 4090 is the only capable card; cosmos’s 980+1050Ti and .193’s 1050Ti are 4GB and on other boxes, so multi-GPU would need llama.cpp RPC and buys little for the big MoE. Core strategy stays 4090 (attention) + RAM + NVMe; RPC-offloading a few expert layers to the weak GPUs is a marginal tuning option.
#2 DONE: .107 (openclaw-k8s-2) drained, removed from the cluster, VM stopped → freed 16GB on host .191. Cluster is now single-node (.190). Added a hermes-ns LimitRange: default request 50m CPU / 64Mi, default limit 500m / 512Mi.
#3: use the best-performance split (attention-on-GPU / experts-on-RAM+NVMe).
#4: latest MiniMax-3 (not M2 / M2.7) at the highest quant that fits (best quality, speed irrelevant) — likely Q6_K/Q8 once disk/RAM allow.
#5 DONE: .106 RAM 12 → 40GB and disk 200 → 800GB (fs auto-grown to 788G, 658G free) — ready to stage MiniMax-3 now and Kimi-2.6 later. jarvis/carnice reloaded fine after the reboot (23GB VRAM, healthy).

Still NOT implemented: the model serving itself (download MiniMax-3, llama-server with MoE offload, the architect agent, the routing layer) — design below.

0. The hard constraint that shapes the whole design

There is one capable GPU — the RTX 4090 (24GB) on .106, and it is already full (carnice-v2-27b-Q5 uses ~23.5/24GB). The box has only 12GB RAM and a 200GB SSD. cosmos (.189) has 1.8TB of disk but only weak 4GB GPUs. So the planner (big MoE) and the builder (carnice-27B) cannot both occupy the 4090 at once — they must time-share it. The operator tier (mercury) lives on cosmos and never touches the 4090, so it runs in parallel.

1. The three tiers (task-type → model → agent)

Tier	Task types	Model	Where	Agent
Architect	planning, design, audit, review	MiniMax-3 (→ Kimi-2.6 later)	4090 (.106) + RAM + NVMe	new `hermes-architect` agent
Builder	implementation / coding	carnice-v2-27b-Q5 (as now)	4090 (.106)	hermes (current)
Operator	infra-in-kubes, email, admin	carnice-9b / Gemma	cosmos (.189)	mercury

Standing rule (effective immediately in routing): any infra work inside Kubernetes goes to mercury, never hermes. Encoded as a routing rule on task-type infra.

Architect and Builder time-share the 4090; Operator (mercury) runs concurrently on cosmos. This matches the natural workflow: audit/plan (Architect) → implement (Builder) are sequential phases, so swapping the 4090 at phase boundaries is cheap relative to the work done in each phase.

2. Agent architecture — task-based routing across peer agents (researched)

Best practice for 2026 multi-model systems is task-based routing: send each task to a domain-specialised agent, rather than one agent calling sub-agents as tools. We already have peer agents (hermes, mercury, calliope) each with its own model endpoint — so the clean fit is to add one more peer agent (hermes-architect) and put a router in the queue/control-plane that dispatches by task type. This beats the “supervisor + subagents-as-tools” pattern here because:

each tier needs a different model endpoint (independent serving, independent GPU/RAM budgets) — peer agents give that natively;
context isolation between planning and implementation is desirable;
it reuses the existing per-agent deployment + --source + config pattern. (Subagents-as-tools would be right if one model spawned helpers sharing its context; that’s not our case.) Sources: Augment “AI model routing guide”, LangChain “choosing a multi-agent architecture”, Oreate “subagents vs multi-agent”.

3. Control-plane feature: choose the model per task type

Extends the existing tasks.paralla.org control plane.

Task type field (planning | design | audit | review | implementation | infra | email | admin), inferred on creation and editable on the card.
Settings → “Model routing” page: a table mapping each task-type → tier → agent/model, fully editable (this is the “select which model works on which tasks” feature). Defaults = the table in §1.
Per-task override: a dropdown on a task to force a specific tier/model.
Queue dispatcher: when a task becomes active, route it to the mapped agent’s API (hermes-architect, hermes, or mercury). For 4090-contended tiers, the dispatcher coordinates with the 4090 model-manager (§4.3).
Stored in wiki/control-plane/model-routing.json; served via /api/routing.

4. Serving MiniMax-3 on the 4090 (Phase 1) — GPU + RAM + NVMe split

MiniMax-M2-class: ~229B total, ~10B active, 256 experts (8 active/token). At Q4_K it is ~115GB on disk; with experts offloaded it needs only ~16GB VRAM + lots of RAM/NVMe for the expert weights.

4.1 Correct the layer placement (important)

The optimal split is the opposite of “MoE layers on GPU”:

On GPU (4090): the attention + shared/dense layers + KV cache. They’re small, compute-heavy, and hit every token → huge benefit from the GPU. (~16-20GB)
In system RAM (page cache): the MoE expert FFN weights — they’re the bulk of the params but sparse (only 8/256 experts run per token), so they tolerate being off-GPU. More RAM = more hot experts cached = faster.
On NVMe (mmap): the full 115GB weight file lives on the SSD and is mmap’d; cold experts page in from disk on demand (the slow part — acceptable per “even if really slow”).

llama.cpp does exactly this via --n-cpu-moe N (keep N MoE layers on CPU) or --override-tensor "\.ffn_.*_exps\.weight=CPU" (all experts to CPU), with the model mmap’d from disk. Source: llama.cpp MoE-offload guide (Doctor-Shotgun), --n-cpu-moe docs.

4.2 Fitting it on .106 — RAM/disk rebalance

Disk: the 200GB SSD holds the ~115GB Q4 model (fits; ~85GB headroom). ✅ for MiniMax.
RAM: .106 has 12GB today — far too little to cache experts. Rebalance on host .191 (62GB total): gaming-vm (101) is stopped → 32GB free, and reduce openclaw-k8s-2 (.107) 16→8GB. That allows .106 ≈ 12 → ~40GB (leaving agents-VM 4GB + k8s-2 8GB + ~8GB host). With ~40GB RAM, ~16GB on GPU, the page cache holds a large fraction of the hot experts; the remainder pages from the SSD. (More RAM later → fewer SSD page-ins → faster.)
Verify before building: exact host headroom, and that reducing k8s-2 to 8GB is safe for its current pods (post-migration most stateful load is on .190).

A small model-manager service on .106 owns the 4090 and runs one big model at a time:

Watches the queue’s active tier. If a Tier-1 (Architect) task is up and carnice is loaded → stop carnice, start MiniMax (llama-server with the MoE offload flags). If a Tier-2 (Builder) task is up → reverse.
Swaps happen at phase boundaries (audit/plan → implement), so they’re rare; load cost (unload 23GB / load 16GB GPU + warm mmap) is amortised over a batch.
Exposed as a “4090 mode: Architect / Builder” control in the dashboard (auto by default, manual override available), reusing the loop-control pattern.
Alternative considered: co-resident (carnice-Q4 ~14GB + MiniMax-attention ~8GB ≈ 22GB) to avoid swaps — rejected for v1: too tight with KV caches and degrades carnice quality. Revisit if a 2nd GPU appears.

4.4 Expected performance

With experts mostly on CPU/NVMe, MiniMax-3 will run at a few tokens/sec (single digits), dominated by SSD/RAM expert reads. That’s fine for the Architect tier (planning/design/audit are infrequent, quality > speed — explicitly the goal).

5. Kimi-2.6 (Phase 2, deferred)

Kimi K2.6 = ~1T total / 32B active / 384 experts, ~340GB even at 2-bit. It won’t fit on .106’s 200GB SSD and needs far more RAM page-cache. Path: stage the weights on cosmos’s 1.8TB SSD and either (a) attach that disk to .106, or (b) serve Kimi from cosmos over the network with the 4090 doing attention (complex). Treat as a later phase once MiniMax-3 is proven; same offload technique, bigger everything.

6. Rollout phases

Routing layer (no new model): add task type, the Model-routing settings page, per-task override, and the dispatcher — wire infra→mercury and keep Architect→hermes (carnice) as a temporary stand-in. Ships value immediately.
MiniMax-3 serving: RAM/disk rebalance on .191; download Q4 GGUF to .106; stand up llama-server with --n-cpu-moe/--override-tensor; the model-manager
- 4090 mode toggle; the hermes-architect agent pointed at it.
Flip Architect tasks to MiniMax-3, measure quality/throughput, tune --n-cpu-moe and RAM.
Kimi-2.6 once disk/RAM allow.

7. Status of the open questions

✅ 2nd GPU / k8s-2 / split / best-perf / disk — all resolved & prepped above.
⚠️ Block-volume redundancy — removing .107 dropped hermes-data-block to a single replica (.190 only); Longhorn can’t rebuild a 2nd replica with one node. Fallback = the stale RWX hermes-data backup. Re-add redundancy when a 2nd storage node returns, or accept single-replica + periodic backups.
⛏ Remaining for Phase-2 build: confirm the exact latest MiniMax-3 release and the top quant that fits ~40GB RAM + 24GB VRAM + 788G disk (start Q4_K_M to validate, then push to Q6/Q8 for quality); decide RPC multi-GPU (worth wiring the weak GPUs?); and the 4090 time-share trigger (auto-by-queue vs manual mode).

Quartz 4

Explorer

Multi-model tiering — design (planner/builder/operator + MiniMax-3 on the 4090)

Multi-model tiering: best model for planning/design/audit, carnice for build, mercury for infra/admin

Decisions (pvs, 2026-06-19) + prep already actioned

0. The hard constraint that shapes the whole design

1. The three tiers (task-type → model → agent)

2. Agent architecture — task-based routing across peer agents (researched)

3. Control-plane feature: choose the model per task type

4. Serving MiniMax-3 on the 4090 (Phase 1) — GPU + RAM + NVMe split

4.1 Correct the layer placement (important)

4.2 Fitting it on .106 — RAM/disk rebalance

4.4 Expected performance

5. Kimi-2.6 (Phase 2, deferred)

6. Rollout phases

7. Status of the open questions

Graph View

Table of Contents

Quartz 4

Explorer

Multi-model tiering — design (planner/builder/operator + MiniMax-3 on the 4090)

Multi-model tiering: best model for planning/design/audit, carnice for build, mercury for infra/admin

Decisions (pvs, 2026-06-19) + prep already actioned

0. The hard constraint that shapes the whole design

1. The three tiers (task-type → model → agent)

2. Agent architecture — task-based routing across peer agents (researched)

3. Control-plane feature: choose the model per task type

4. Serving MiniMax-3 on the 4090 (Phase 1) — GPU + RAM + NVMe split

4.1 Correct the layer placement (important)

4.2 Fitting it on .106 — RAM/disk rebalance

4.3 The 4090 time-share (“model manager”)

4.4 Expected performance

5. Kimi-2.6 (Phase 2, deferred)

6. Rollout phases

7. Status of the open questions

Graph View

Table of Contents