fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B regression)

Patch 0041 (LLAMA_PAGED_DECODE_STABLE) was made default-on-when-paged, but a measured end-to-end A/B proved that is a serving mistake. S3 defers prefill admission on the period-8 cadence, which delays prompt admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency. Its apparent decode_agg gain was a metric artifact (faster per-step decode bought by starving prefill). Flip the s3_enabled default so an unset LLAMA_PAGED_DECODE_STABLE means OFF; the mechanism stays available as an explicit opt-in (LLAMA_PAGED_DECODE_STABLE=1) for decode-dominated, low-arrival traffic where TTFT is not a concern. The default now prefers prompt prefill admission for good TTFT. S1 (patch 0040) keeps shipping default-on; only S3's default changes. Re-exports patch 0041 (change folded into its source commit) and updates the README 0041 row plus the decode-serving narrative to record the A/B finding. Greedy md5 gate unchanged (single-sequence llama-completion path, not update_slots): paged MoE 8cb0ce23, dense 5951a5b4. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 19:06:43 -04:00 · 2026-06-29 05:00:11 +00:00
parent b028c81eda
commit f1c98ff0b9
2 changed files with 57 additions and 24 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -86,7 +86,7 @@ orthogonal to the paged allocator.

 ---

-## 3. Patch series (0001-0041)
+## 3. Patch series (0001-0043)

 Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
 decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
@@ -135,14 +135,19 @@ hides.
 | # | What it does | Bit-exact |
 |---|---|---|
 | 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) |
-| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default ON under paged KV** (enabled when `LLAMA_KV_PAGED` is set; `LLAMA_PAGED_DECODE_STABLE=0` forces it off). | yes (byte-identical on/off; per-stream independent in serving) |
+| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default OFF** (opt-in via `LLAMA_PAGED_DECODE_STABLE=1`): a measured end-to-end A/B proved default-on is a serving mistake - deferring prefill admission on the period-8 cadence gives **2.5x worse TTFT** (60s vs 24s at N=256) and **20-29% lower end-to-end throughput**, with no end-to-end win at any concurrency; its apparent `decode_agg` gain was a metric artifact (faster per-step decode bought by starving prefill). Default prefers prompt prefill admission for good TTFT; opt in only for decode-dominated, low-arrival traffic where TTFT is not a concern. | yes (byte-identical on/off; per-stream independent in serving) |

 Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
 graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**,
 decode **4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9
 sustained)**. S1 is necessary but **not** sufficient alone (13.8% reuse - prefill
-co-batching churns the shape nearly every step); S3 is the multiplier, so they
-ship and are measured together. The static batched-bench A/B isolates the S1
+co-batching churns the shape nearly every step); S3 is the multiplier of that
+per-step decode metric. **But those are per-step decode numbers, not an end-to-end
+serving win**: a later end-to-end A/B showed S3-default-on regresses real serving
+(2.5x worse TTFT, 20-29% lower end-to-end throughput, no win at any concurrency),
+because the period-8 cadence defers prefill admission. So **only S1 (0040) ships
+default-on; S3 (0041) now defaults OFF and is opt-in** (`LLAMA_PAGED_DECODE_STABLE=1`,
+for decode-dominated low-arrival traffic). The static batched-bench A/B isolates the S1
 mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the static
 regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0
 profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input
@@ -168,12 +173,13 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
 | 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% |
 | 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s |

-### MoE NVFP4 quant (0023, 0025)
+### MoE NVFP4 quant (0023, 0025, 0043)

 | # | What it does | Bit-exact |
 |---|---|---|
 | 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
-| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
+| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Was env-gated `LLAMA_MOE_FORCE_GRAPHS`; now ON by default via 0043. | yes (graph replay re-issues identical kernels) |
+| 0043 | **MoE decode graph default-on (D1)** - flip 0025 to ON by default: capture/replay the full-step decode CUDA graph (incl. the grouped-MMQ MoE dispatch) instead of re-issuing every kernel each step. Guard is `should_use_mmq()` (FALSE for the large-M NVFP4 prefill of 0034, so prefill keeps graphs disabled - its per-expert host-loop genuinely syncs). `LLAMA_MOE_NO_FORCE_GRAPHS=1` forces the conservative pre-0025 disable for A/B. D1 profiling: the per-expert host-loop (the only device->host MoE-routing readback) is never hit on the NVFP4 grouped path (sync count identical graphs on/off); steady decode is ~99% GPU-busy, so the cost removed is per-step host kernel RE-ISSUE, not a sync. | yes (md5 byte-identical default/off/forced; paged-MoE `8cb0ce23`, dense `5951a5b4`) |

 ### Pool reclaim, block-table cache, backend gate

@@ -338,6 +344,21 @@ llama is losing. The MoE GEMM kernel is *not* where the gap lives.
 - **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was
  exhausted by 0025; more graph/stream overlap is a no-op or small regression on
  this model.
+- **D1 premise "static decode is host-sync-bound on the MoE-routing readback":
+  REFUTED.** The hypothesis was that the dominant decode cost is the device->host
+  readback of MoE routing before launching the per-expert GEMMs (mul_mat_id's
+  per-expert host-loop fallback). Profiling (GB10, q36-35b-a3b-nvfp4, batched-bench
+  npl128) shows the opposite: on NVFP4 the grouped stream-k MMQ id-path is what
+  runs (routing stays device-side), so the host-loop fallback is **never hit** -
+  `cudaStreamSynchronize` count is *identical* with CUDA graphs on vs off (1457
+  either way; only the kernel-launch count changes, ~100k vs ~229k). Steady-decode
+  GPU-busy is **~99%** (1% idle), i.e. static decode is GPU-bound, not idle waiting
+  on a sync. The one actionable residual the profile surfaced - per-step host
+  kernel **re-issue** when the step is not graph-captured - shipped as 0043
+  (default-on full-step decode graph), worth +2.6% (npl128) to +5-13% (npl32). The
+  larger continuous-serving host cost is the graph **rebuild** (0040/0041), and the
+  irreducible floor is the per-step logits-D2H-before-sampling serial point - none
+  of which is the MoE-routing readback.
 - **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only
  by W4A16 (a precision change, rejected) or a structural kernel rewrite; no
  further bit-exact lever clears it. 0023 already banks the de-dup.
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch
@@ -1,4 +1,4 @@
-From ee8021b56ed0effe493a64aa50449ab928dd6b29 Mon Sep 17 00:00:00 2001
+From ddff2279f23f18cadfbbb907a397d66b3609e9cd Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
 Date: Sun, 28 Jun 2026 20:00:24 +0200
 Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch
@@ -23,30 +23,36 @@ budget; no new slot states, no batch-formation rewrite, zero libllama changes.
 BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence's
 decode logits depend on its own tokens + its own KV only (the paged decode read is
 per-stream, attention is permutation-invariant over the co-batched set), so
-deferring another slot's prefill never changes a generating slot's output.
-DEFAULT-ON under paged KV: with LLAMA_KV_PAGED set this enables by default
-(LLAMA_PAGED_DECODE_STABLE=0 forces it off); otherwise unset => byte-identical to
-patch 0016. Does
+deferring another slot's prefill never changes a generating slot's output. Does
 not run in the single-sequence greedy md5 gate (that path is llama-completion).

+DEFAULT-OFF (A/B finding): a measured end-to-end A/B proved that making S3
+default-on under paged KV is a serving mistake. Deferring prefill admission on the
+period-8 cadence defers prompt admission: 2.5x worse TTFT (60s vs 24s at N=256)
+and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency.
+Its apparent decode_agg gain was a metric artifact (faster per-step decode bought
+by starving prefill). So S3 now defaults OFF (prefer prompt prefill admission for
+good TTFT) and is opt-in via LLAMA_PAGED_DECODE_STABLE=1, intended only for
+decode-dominated, low-arrival traffic where TTFT is not a concern. With
+LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016.
+
 Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
 S1+S3 vs baseline (graphs rebuilt every step): graph reuse 0% -> 72.2%, hostproc
 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean,
-at vLLM's ~5.9 sustained). Remaining 28% rebuilds are request-boundary D/seq-set
-churn + the prefill-cadence steps; closing them needs a padded/fixed-slot decode
-shape (scoped follow-up, see DECODE_SERVING_SCOPE.md).
+at vLLM's ~5.9 sustained). NOTE these are per-step decode metrics; the A/B above
+shows they do not translate to an end-to-end serving win, hence default-off.

 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- tools/server/server-context.cpp | 40 ++++++++++++++++++++++++++++++++-
- 1 file changed, 39 insertions(+), 1 deletion(-)
+ tools/server/server-context.cpp | 46 ++++++++++++++++++++++++++++++++-
+ 1 file changed, 45 insertions(+), 1 deletion(-)

 diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 64775dc..fc0231a 100644
+index 64775dc..a77e267 100644
 --- a/tools/server/server-context.cpp
 +++ b/tools/server/server-context.cpp
-@@ -3138,11 +3138,49 @@ private:
+@@ -3138,11 +3138,55 @@ private:
         }
         int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
 
@@ -66,16 +72,22 @@ index 64775dc..fc0231a 100644
 +        // Each sequence's decode logits depend on its own tokens + its own KV only
 +        // (the paged decode read is per-stream, attention is permutation-invariant
 +        // over the co-batched set), so deferring another slot's prefill never
-+        // changes a generating slot's output. DEFAULT-ON under paged KV: with
-+        // LLAMA_KV_PAGED set it enables by default (LLAMA_PAGED_DECODE_STABLE=0
-+        // forces off); otherwise byte-identical to patch 0016. Does not run in the
-+        // single-sequence greedy md5 gate (that path is llama-completion, not update_slots).
+        // changes a generating slot's output. Does not run in the single-sequence
+        // greedy md5 gate (that path is llama-completion, not update_slots).
+        //
+        // DEFAULT-OFF (A/B finding): an end-to-end A/B proved S3-on is a serving
+        // mistake. Deferring prefill admission on the period-8 cadence delays prompt
+        // admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end
+        // throughput, with no end-to-end win at any concurrency. Its apparent
+        // decode_agg gain was a metric artifact (faster per-step decode bought by
+        // starving prefill). So the default prefers prompt prefill admission for good
+        // TTFT; S3 is opt-in (LLAMA_PAGED_DECODE_STABLE=1) only for decode-dominated,
+        // low-arrival traffic where TTFT is not a concern.
 +        bool decode_only_step = false;
 +        {
 +            static const int s3_enabled = [](){
 +                const char * e = getenv("LLAMA_PAGED_DECODE_STABLE");
-+                if (e) { return atoi(e); }                          // explicit override (=0 forces off)
-+                return getenv("LLAMA_KV_PAGED") != nullptr ? 1 : 0; // default ON under paged KV
+                return e ? atoi(e) : 0;                             // default OFF; opt-in via LLAMA_PAGED_DECODE_STABLE=1
 +            }();
 +            if (s3_enabled && n_decode_in_batch > 0) {
 +                static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }();