From c7075fb7960f2b210a7f2688a20ba8a0c5763436 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 21:38:08 +0000
Subject: [PATCH] docs(paged): MoE 35B-A3B NVFP4 fair re-run with
 max_prefill_tokens budget

Budget 256/512 sweep on the A3B MoE under patch 0013. Mirror image of the
dense case: stock MoE was never prefill-starved (3B active, TTFT 84.8s @npl128),
so the budget is a decode-throughput lever paid for in TTFT, not a TTFT fix.
Budget 256 lifts decode_agg +14% (292->333.5 tok/s) and restores monotonic
decode scaling (kills the stock +7.4% plateau, now +20% into npl128), moving
llama 36.0%->41.1% of vLLM decode. Gap not closed: vLLM still ~2.4x decode and
~12x lower TTFT @npl128.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .../patches/paged/QWEN36_NVFP4_BENCH.md       | 61 +++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
index dcf284e94..aba4fabc4 100644
--- a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
+++ b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
@@ -166,3 +166,64 @@ from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still d
 faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At
 light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it
 should be applied selectively (high-concurrency serving), not as an unconditional default.
+
+## MoE 35B-A3B fair re-run (max_prefill_tokens on)
+
+Same build (HEAD 151343b, P0+P1 patch 0015), same flags (`-c 131072 --parallel 128 -b 2048
+-ub 512 -ngl 99 -fa on`, `LLAMA_KV_PAGED=1`), same all-at-once harness (512-tok unique prompt,
+gen 256, temp 0, ignore_eos). Swept the dense winner budget 256 plus neighbor 512.
+
+### Primary table - budget 256 (decode_agg tok/s | TTFT mean ms | peak host GB)
+
+| npl | stock (no budget) | budget 256 (best) | budget 512 | vLLM |
+|----:|------------------:|------------------:|-----------:|-----:|
+| 8   | 170.2 / 855   / -    | 169.3 / 1655  / 38.95 | 172.1 / 1488  / 38.82 | 202.0 / 799  |
+| 32  | 235.4 / 4970  / -    | 239.0 / 9034  / 42.93 | 234.7 / 7260  / 42.72 | 462.0 / 2308 |
+| 64  | 271.7 / 7205  / -    | 277.0 / 16249 / 51.96 | 274.5 / 13660 / 52.53 | 624.5 / 4072 |
+| 128 | 292.2 / 84800 / -    | **333.5 / 98106 / 61.42** | 300.8 / 92470 / 61.45 | 811.1 / 7980 |
+
+Peak host GB (paged KV, budget-independent): ~38.9 (npl8) -> ~42.8 (npl32) -> ~52 (npl64) ->
+~61.4 (npl128). Far below the dense run (94 GB @npl128) - only ~3B params are active, so the KV
+plus activations footprint stays light even fully saturated.
+
+### MoE inverts the dense story: the budget buys decode, NOT TTFT
+
+Unlike the dense 27B (where the stock run was prefill-starved to 491 s TTFT @npl128 and the budget
+cut it 38%), the MoE stock run was **never prefill-starved**: 3B active params make prefill cheap,
+so stock TTFT @npl128 was already only 84.8 s. Capping prefill therefore cannot rescue TTFT - it
+can only **defer first tokens to free decode steps**. Result at npl128 with budget 256:
+
+- **decode_agg: 292.2 -> 333.5 tok/s (+14.1%)** vs the starved stock run.
+- **TTFT mean: 84.8 s -> 98.1 s (+15.7%, WORSE)** - the budget costs latency here.
+- llama decode as % of vLLM @npl128: **36.0% -> 41.1%**. TTFT now ~12.3x vLLM's 7.98 s.
+
+Budget 512 is the milder trade (decode +3.0% to 300.8, TTFT +9.0% to 92.5 s @npl128). Budget 256
+maximizes decode throughput; 512 if you want to bleed less TTFT. At npl 8/32/64 both budgets are
+net-negative or flat on decode and clearly raise TTFT (e.g. npl64 7.2 s -> 16.2 s @b256), the same
+all-at-once burst artifact seen in the dense run.
+
+### Does the ~3B-active decode scale better now? Yes - the plateau is gone
+
+The headline win is the **decode scaling curve**, not any single point:
+
+| npl step | stock decode_agg | budget-256 decode_agg |
+|---------:|-----------------:|----------------------:|
+| 8 -> 32  | 170 -> 235 (+38%) | 169 -> 239 (+41%) |
+| 32 -> 64 | 235 -> 272 (+16%) | 239 -> 277 (+16%) |
+| 64 -> 128| 272 -> 292 (**+7.4%**, plateauing) | 277 -> 333.5 (**+20.4%**, still climbing) |
+
+Stock MoE decode **plateaus** at saturation (+7.4% over the last doubling) because unbounded
+prefills keep stealing steps from the many ready decode slots. Budget 256 removes that ceiling -
+decode keeps climbing +20% into npl128, so more of the 128 slots actually decode concurrently.
+This is the cleanest evidence that patch 0013 protects in-flight decode once enough slots are live.
+
+### Bottom line (MoE)
+
+For the A3B MoE the prefill budget is a **decode-throughput lever, paid for in TTFT** - the mirror
+image of the dense case. Budget 256 lifts decode_agg +14% @npl128 and, more importantly, restores
+monotonic decode scaling (kills the stock plateau), moving llama from 36.0% to 41.1% of vLLM
+decode - the same ~41% ceiling the dense run hit. It does **not** close the gap: vLLM still decodes
+~2.4x faster (811 vs 333.5) and holds TTFT ~12x lower (8.0 s vs 98.1 s) @npl128, and scales
+monotonically and steeply where llama only partially recovers. Net: apply the budget to saturated
+MoE serving when decode throughput is the objective and some extra TTFT is acceptable; for
+latency-sensitive MoE serving leave it off (stock TTFT was already not the bottleneck here).