feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041)

Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 19:06:43 -04:00 · 2026-06-28 18:04:28 +00:00
parent 000705321f
commit d706980c2b
4 changed files with 482 additions and 5 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -86,9 +86,10 @@ orthogonal to the paged allocator.

 ---

-## 3. Patch series (0001-0031)
+## 3. Patch series (0001-0041)

-29 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
+Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
+decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
 `test-backend-ops` byte-identical to the relevant baseline; the gate methodology
 is in section 5.

@@ -122,6 +123,33 @@ bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green
 but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
 ships as the parity gate + default-off instrumentation only.)

+### Decode-serving graph reuse (0040, 0041)
+
+These two close the **continuous-serving** decode gap (distinct from the static
+batched-bench decode kernel, which is already at vLLM parity - see
+[`docs/DECODE_SERVING_SCOPE.md`](docs/DECODE_SERVING_SCOPE.md)). In serving the
+host rebuilt the ggml graph on **every** decode step (layer-A graph reuse was 0%),
+so the GPU idled while the host rebuilt - the host-bound -39% the static bench
+hides.
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) |
+| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). `LLAMA_PAGED_DECODE_STABLE=1` to enable. | yes (default-off byte-identical; per-stream independent in serving) |
+
+Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
+graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**,
+decode **4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9
+sustained)**. S1 is necessary but **not** sufficient alone (13.8% reuse - prefill
+co-batching churns the shape nearly every step); S3 is the multiplier, so they
+ship and are measured together. The static batched-bench A/B isolates the S1
+mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the static
+regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0
+profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input
+copy), so it has nothing to recover. The remaining ~28% serving rebuilds are
+request-boundary D/seq-set churn + the prefill-cadence steps; a padded/fixed-slot
+decode shape to capture them is scoped in `docs/DECODE_SERVING_SCOPE.md`.
+
 ### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)

 These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
--- a/backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md
@@ -1,7 +1,39 @@
-# DECODE_SERVING_SCOPE - the continuous-serving decode gap (design only)
+# DECODE_SERVING_SCOPE - the continuous-serving decode gap

-**Status: DESIGN + SCOPE + RANKED LEVER PLAN ONLY. No kernel written, no GPU
-run in this pass (the GPU was busy with prefill agents).** Per the
+**Status: S1 + S3 IMPLEMENTED, GPU-validated, bit-exact, shipped as patches
+0040 (S1) + 0041 (S3). S2 DROPPED (measured non-target). See the results block
+below; the rest of this doc is the design/rationale those patches implement.**
+
+## Results (GB10, measured)
+
+Phase 0 confirmed host-bound: serving graph reuse **0% over ~5k steps** (layer-A
+rebuilds every step), `hostproc` 3.44 ms/step vs 1.59 static - the +1.85 ms IS the
+graph rebuild; `set_inputs` 0.047 ms and block-table 0.002 ms are negligible.
+
+- **S1 (patch 0040)** - root cause: the paged decode inputs never overrode
+  `can_reuse` (defaults false), so the graph could never be reused. Fixed with a
+  256-bucketed-shape `can_reuse` + live-mctx refresh. Static batched-bench A/B:
+  paged decode reuse **0% -> 95.5%**, bit-exact (md5 byte-identical reuse on/off).
+  Necessary but **not** sufficient in serving (13.8% reuse alone - prefill
+  co-batching churns the shape).
+- **S3 (patch 0041)** - keeps prefill out of decode steps so the scheduler emits
+  reuse-stable pure-decode steps. **S1+S3 together (128-client staggered serving,
+  MoE Qwen3.6-35B-A3B-NVFP4): reuse 0% -> 72.2%, `hostproc` 15.98 -> 6.31 ms/step,
+  decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9).**
+- **S2 (double-buffer set_inputs) - DROPPED.** Phase 0 put `set_inputs` at
+  ~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover.
+- **Follow-up to ~100% reuse:** the remaining ~28% serving rebuilds are
+  request-boundary D/seq-set churn + the S3 prefill-cadence steps. Capturing them
+  needs a **padded/fixed-slot decode shape** (pad the decode width to a fixed
+  bucket with masked-inert dummy slots so `n_tokens` and the seq-id set stay
+  constant across arrivals/completions - the lever S1 section (a) describes).
+  Deferred: S1+S3 already reach vLLM-parity on the mean; padding is server-side,
+  invasive, and not exercised by the single-sequence md5 gate (needs a per-stream
+  serving-determinism gate). It is the next lever, not a shipped one.
+
+---
+
+Per the
 "profile-don't-assume" rule in
 [`.agents/vllm-parity-methodology.md`](../../../../.agents/vllm-parity-methodology.md),
 **Phase 0 (section 5) is to confirm the bottleneck on GPU before touching any
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0040-feat-paged-S1-paged-decode-graph-reuse-across-servin.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0040-feat-paged-S1-paged-decode-graph-reuse-across-servin.patch
@@ -0,0 +1,322 @@
+From b81fa71360c3f6b46e97c6ad504efc10bdaea484 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Sun, 28 Jun 2026 20:00:04 +0200
+Subject: [PATCH 40/41] feat(paged): S1 paged decode-graph reuse across serving
+ steps (patch 0040)
+
+The continuous-serving decode gap (paged ~3.7 vs vLLM ~5.9 tok/s/seq) is
+host-bound: llama-context layer-A graph reuse was 0% in serving, so the host
+rebuilt the ggml graph EVERY decode step (the +1.85 ms/step the Phase-0 profile
+attributes to the rebuild; set_inputs/block-table are negligible). Root cause:
+the paged decode inputs (input_block_table / input_gather_idxs in paged-attn.cpp)
+never overrode llm_graph_input_i::can_reuse, which defaults to false - so any
+graph carrying a paged input could never be reused, even with a constant batch
+shape. (This is also why the paged decode graph rebuilt in static batched-bench.)
+
+S1 gives the paged inputs a correct can_reuse:
+  - reuse iff the input tensor dims are unchanged. The block table is
+    [n_view, n_stream] with n_view = PAD(n_gather, 256) clamped to n_kv, so it is
+    bucketed to 256 and stays constant across a 256-token decode window; n_stream
+    follows n_seqs_unq. The index CONTENTS are refilled at set_input on every step
+    (incl. reused steps), so a reused graph reads the current step's cells.
+  - the stored kv-cache context is refreshed from the owning attn input
+    (llm_graph_input_attn_kv, whose mctx is updated per-decode by attn_kv /
+    mem_hybrid can_reuse earlier in the input list), so a reused graph picks up the
+    live memory context. mem_hybrid::can_reuse now also refreshes inp_attn->mctx.
+
+Master switch paged_attn::decode_graph_reuse() (ON by default when paged;
+LLAMA_PAGED_NO_GRAPH_REUSE=1 forces the pre-S1 rebuild-every-step path for A/B).
+Also surfaces the run-wide graph-reuse rate in the [L5INSTR] exit line
+(l5_add_proc) since llama-server does not print llama_perf.
+
+BIT-EXACT: greedy md5 byte-identical with reuse ON vs OFF on every path -
+dense 5951a5b4d624ce891e22ab5fca9bc439, paged-MoE 8cb0ce23777bf55f92f63d0292c756b0.
+Reuse only skips the host-side rebuild; set_inputs still re-runs every step.
+
+Measured (GB10): batched-bench paged decode graph reuse 0% -> 95.5% (hostproc
+dense 3.31->2.66, MoE 2.44->1.82 ms/step); static throughput flat as expected
+(static regime is GPU-bound). The serving payoff needs S3 (patch 0041): S1 alone
+holds only 13.8% reuse in serving because co-batched prefill churns the shape
+every step.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/llama-context.cpp |  3 ++
+ src/llama-graph.cpp   | 13 +++++-
+ src/paged-attn.cpp    | 94 ++++++++++++++++++++++++++++++++++++++-----
+ src/paged-attn.h      | 14 +++++++
+ 4 files changed, 112 insertions(+), 12 deletions(-)
+
+diff --git a/src/llama-context.cpp b/src/llama-context.cpp
+index c408eef..306a506 100644
+--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
+@@ -1347,6 +1347,7 @@ bool llama_context::set_adapter_cvec(
+ 
+ extern "C" void l5_add_setinp(double ns);
+ extern "C" void l5_add_hostproc(double ns);
+extern "C" void l5_add_proc(int reused); // [S1] per-step graph-reuse counter
+ static inline double l5c_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; }
+ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) {
+     double _l5_t0=l5c_now_ns();
+@@ -1374,7 +1375,9 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
+         }
+ 
+         n_reused++;
+        l5_add_proc(1);
+     } else {
+        l5_add_proc(0);
+         res->reset();
+ 
+         ggml_backend_sched_reset(sched.get());
+diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
+index 931258d..0337742 100644
+--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
+@@ -699,6 +699,12 @@ bool llm_graph_input_mem_hybrid::can_reuse(const llm_graph_params & params) {
+ 
+     this->mctx = mctx;
+ 
+    // [S1] refresh the attn sub-input's memory context so paged decode inputs
+    // (which read owner->mctx in their can_reuse, run later in the input list)
+    // pick up the live per-decode context on a reused graph. Harmless for the
+    // non-paged path: inp_attn->mctx is only consumed at graph-build time there.
+    inp_attn->mctx = mctx->get_attn();
+
+     bool res = true;
+ 
+     res &= inp_attn->self_k_idxs->ne[0] == params.ubatch.n_tokens;
+@@ -2370,8 +2376,11 @@ ggml_tensor * llm_graph_context::build_attn(
+     ggml_tensor * kq_mask_g   = kq_mask;
+     ggml_tensor * block_table = nullptr;
+     const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
+-    if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
+-        paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+    // [S1] pass `inp` (the attn input) as the reuse owner: its mctx is refreshed
+    // per-decode by attn_kv/mem_hybrid can_reuse, and the paged inputs read it so
+    // a reused graph picks up the live memory context.
+    if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, inp, &k, &v, &kq_mask_g, &block_table))) {
+        paged_attn::gather(ctx0, res, mctx_cur, inp, &k, &v, &kq_mask_g);
+     }
+ 
+     ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+index ebd92be..d543c7f 100644
+--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
+@@ -11,9 +11,13 @@
+ #include <ctime>
+ namespace { static inline double l5_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } }
+ double g_l5_t_gbt=0, g_l5_t_setinp=0, g_l5_t_hostproc=0; long g_l5_n_gbt=0, g_l5_n_setinp=0, g_l5_n_hostproc=0;
+// [S1] graph-reuse counters across the whole run (the serving reuse-rate signal -
+// llama-server does not print llama_perf, so surface it here at process exit).
+long g_l5_n_proc=0, g_l5_n_reused=0;
+ extern "C" void l5_add_setinp(double ns){ g_l5_t_setinp+=ns; g_l5_n_setinp++; }
+ extern "C" void l5_add_hostproc(double ns){ g_l5_t_hostproc+=ns; g_l5_n_hostproc++; }
+-namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0 ); } } g_l5_printer; }
+extern "C" void l5_add_proc(int reused){ g_l5_n_proc++; if (reused) g_l5_n_reused++; }
+namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms | graph_reuse %ld/%ld = %.1f%%\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0, g_l5_n_reused, g_l5_n_proc, g_l5_n_proc? 100.0*g_l5_n_reused/g_l5_n_proc:0.0 ); } } g_l5_printer; }
+ 
+ 
+ namespace paged_attn {
+@@ -28,17 +32,52 @@ static bool debug() {
+     return d;
+ }
+ 
+// [S1] paged decode-graph reuse master switch. ON by default whenever paging is
+// active; LLAMA_PAGED_NO_GRAPH_REUSE=1 forces it off (A/B probe / safety hatch).
+bool decode_graph_reuse() {
+    static const bool on = active() && (std::getenv("LLAMA_PAGED_NO_GRAPH_REUSE") == nullptr);
+    return on;
+}
+
+ namespace {
+ 
+// [S1] Recompute the block-table view length the SAME way in_kernel_decode()
+// builds it, so can_reuse() can compare against the stored tensor dim. n_view is
+// PAD(n_gather,256) clamped to the physical window n_kv: it only changes when
+// n_gather crosses a 256 boundary, so a steady decode reuses across many steps.
+static inline int64_t paged_block_table_n_view(const llama_kv_cache_context * mctx) {
+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
+    if (n_gather <= 0) {
+        return 0;
+    }
+    int64_t n_view = GGML_PAD(n_gather, 256);
+    const int64_t n_kv = (int64_t) mctx->get_n_kv();
+    if (n_view > n_kv) {
+        n_view = n_kv;
+    }
+    return n_view;
+}
+
+// [S1] Number of attention streams the paged inputs build over - matches K->ne[3]
+// at build time and the n_stream used by can_reuse_kq_mask in llama-graph.cpp.
+static inline int64_t paged_n_stream(const llm_graph_params & params) {
+    return params.cparams.kv_unified ? 1 : (int64_t) params.ubatch.n_seqs_unq;
+}
+
+ // Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
+ // with each stream's non-empty cell indices (position-sorted, padded with a
+-// masked/empty cell) by delegating to the kv-cache context. Private to this
+-// unit; default can_reuse()==false keeps the graph from being reused across
+-// decodes (n_gather grows every step).
+// masked/empty cell) by delegating to the kv-cache context. Private to this unit.
+//
+// [S1] can_reuse: the graph topology depends only on the tensor SHAPE
+// [n_gather, n_stream] - the index CONTENTS are refilled at set_input every step,
+// so they need not match. n_gather is UNPADDED here (the gather path is used for
+// prefill / transposed-V fallback), so it grows every decode and reuse rarely
+// holds - correct and harmless. mctx is refreshed from the owning attn input
+// (whose mctx is updated by attn_kv/mem_hybrid can_reuse earlier in the input list).
+ class input_gather_idxs : public llm_graph_input_i {
+ public:
+-    input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
+-        : mctx(mctx), idxs(idxs) {}
+    input_gather_idxs(const llama_kv_cache_context * mctx, const llm_graph_input_attn_kv * owner, ggml_tensor * idxs)
+        : mctx(mctx), owner(owner), idxs(idxs) {}
+ 
+     void set_input(const llama_ubatch * ubatch) override {
+         GGML_UNUSED(ubatch);
+@@ -46,17 +85,37 @@ public:
+         mctx->get_gather_idxs((int32_t *) idxs->data);
+     }
+ 
+    bool can_reuse(const llm_graph_params & params) override {
+        if (!owner || !paged_attn::decode_graph_reuse()) {
+            return false;
+        }
+        mctx = owner->mctx; // refresh to the live per-decode context
+        const int64_t n_gather = (int64_t) mctx->get_n_gather();
+        if (n_gather <= 0) {
+            return false;
+        }
+        return idxs->ne[0] == n_gather && idxs->ne[1] == paged_n_stream(params);
+    }
+
+     const llama_kv_cache_context * mctx;
+    const llm_graph_input_attn_kv * owner;
+     ggml_tensor * idxs;
+ };
+ 
+ // Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
+ // tensor with each stream's position-ordered cells, padded to n_blk (per column)
+ // with a masked empty cell, by delegating to the kv-cache context.
+//
+// [S1] can_reuse: reuse iff the block-table tensor dims [n_view, n_stream] are
+// unchanged - n_view is bucketed to 256 (paged_block_table_n_view), so the decode
+// graph reuses across every step within a 256-token window. The table CONTENTS
+// are refilled at set_input on every step (incl. reused steps), so the reused
+// graph reads the current step's cells. mctx is refreshed from the owning attn
+// input so the reused graph's set_input/get_block_table uses the live context.
+ class input_block_table : public llm_graph_input_i {
+ public:
+-    input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
+-        : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
+    input_block_table(const llama_kv_cache_context * mctx, const llm_graph_input_attn_kv * owner, ggml_tensor * idxs, uint32_t n_blk)
+        : mctx(mctx), owner(owner), idxs(idxs), n_blk(n_blk) {}
+ 
+     void set_input(const llama_ubatch * ubatch) override {
+         GGML_UNUSED(ubatch);
+@@ -66,7 +125,20 @@ public:
+         g_l5_t_gbt += l5_now_ns()-_t; g_l5_n_gbt++;
+     }
+ 
+    bool can_reuse(const llm_graph_params & params) override {
+        if (!owner || !paged_attn::decode_graph_reuse()) {
+            return false;
+        }
+        mctx = owner->mctx; // refresh to the live per-decode context
+        const int64_t n_view = paged_block_table_n_view(mctx);
+        if (n_view <= 0 || n_view != (int64_t) n_blk) {
+            return false;
+        }
+        return idxs->ne[0] == n_view && idxs->ne[1] == paged_n_stream(params);
+    }
+
+     const llama_kv_cache_context * mctx;
+    const llm_graph_input_attn_kv * owner;
+     ggml_tensor * idxs;
+     uint32_t n_blk;
+ };
+@@ -76,6 +148,7 @@ public:
+ void gather(ggml_context * ctx0,
+             llm_graph_result * res,
+             const llama_kv_cache_context * mctx,
+            const llm_graph_input_attn_kv * owner,
+             ggml_tensor ** k,
+             ggml_tensor ** v,
+             ggml_tensor ** kq_mask) {
+@@ -114,7 +187,7 @@ void gather(ggml_context * ctx0,
+     // n_stream, so column s gathers from stream s of the source.
+     ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
+     ggml_set_input(idx);
+-    res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
+    res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, owner, idx)));
+ 
+     // --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
+     {
+@@ -156,6 +229,7 @@ void gather(ggml_context * ctx0,
+ bool in_kernel_decode(ggml_context * ctx0,
+                       llm_graph_result * res,
+                       const llama_kv_cache_context * mctx,
+                      const llm_graph_input_attn_kv * owner,
+                       ggml_tensor ** k,
+                       ggml_tensor ** v,
+                       ggml_tensor ** kq_mask,
+@@ -221,7 +295,7 @@ bool in_kernel_decode(ggml_context * ctx0,
+ 
+     ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
+     ggml_set_input(idx);
+-    res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
+    res->add_input(llm_graph_input_ptr(new input_block_table(mctx, owner, idx, (uint32_t) n_view)));
+ 
+     // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
+     // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
+diff --git a/src/paged-attn.h b/src/paged-attn.h
+index 23e2184..fafe821 100644
+--- a/src/paged-attn.h
+++ b/src/paged-attn.h
+@@ -21,18 +21,31 @@ struct ggml_context;
+ struct ggml_tensor;
+ class  llm_graph_result;
+ class  llama_kv_cache_context;
+class  llm_graph_input_attn_kv;
+ 
+ namespace paged_attn {
+ 
+ // true iff env LLAMA_KV_PAGED is set (evaluated once).
+ bool active();
+ 
+// [S1] true iff the paged decode-graph reuse (layer-A can_reuse on the paged
+// inputs) is ENABLED. Default ON when active(); LLAMA_PAGED_NO_GRAPH_REUSE=1
+// forces it off (A/B probe / safety hatch). When off the paged inputs keep the
+// stock default can_reuse()==false, i.e. the pre-S1 behaviour (rebuild every
+// step). Bit-exact either way - reuse only skips the host-side graph rebuild,
+// set_inputs still re-runs every step.
+bool decode_graph_reuse();
+
+ // Gather K, V and the kq_mask down to the current sequence's non-empty cells.
+ // No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
+ // point at the compacted tensors; pass them straight to build_attn_mha.
+// `owner` is the attention input that owns the live (per-decode-refreshed) memory
+// context; the paged input reads owner->mctx in can_reuse so a reused graph picks
+// up the fresh context (see input_gather_idxs::can_reuse). May be null (no reuse).
+ void gather(ggml_context * ctx0,
+             llm_graph_result * res,
+             const llama_kv_cache_context * mctx,
+            const llm_graph_input_attn_kv * owner,
+             ggml_tensor ** k,
+             ggml_tensor ** v,
+             ggml_tensor ** kq_mask);
+@@ -50,6 +63,7 @@ void gather(ggml_context * ctx0,
+ bool in_kernel_decode(ggml_context * ctx0,
+                       llm_graph_result * res,
+                       const llama_kv_cache_context * mctx,
+                      const llm_graph_input_attn_kv * owner,
+                       ggml_tensor ** k,
+                       ggml_tensor ** v,
+                       ggml_tensor ** kq_mask,
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch
@@ -0,0 +1,95 @@
+From ef2765d85829c9ede2fc9aa90523386d765c9040 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Sun, 28 Jun 2026 20:00:24 +0200
+Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch
+ 0041)
+
+The S1 paged decode-graph reuse (patch 0040) is necessary but not sufficient in
+continuous serving: with cont_batching a co-batched prefill chunk inflates the
+step from n_tokens==D (pure decode) to D+P, which changes the ubatch shape and
+breaks llama-context layer-A reuse on (nearly) every step. Measured: S1 alone
+holds only 13.8% graph reuse in a 128-client serving load.
+
+S3 makes the scheduler EMIT graph-reusable steps to match what S1 makes reusable.
+While there is live decode load it runs PURE-decode steps (skip Phase-2 prompt
+admission) so the decode batch shape stays constant, and admits a prefill chunk
+only on a bounded cadence (every LLAMA_PAGED_PREFILL_PERIOD steps, default 8) or
+when no decode is active. The deferred prefill chunk still runs within at most
+(period-1) decode steps, so prompt latency rises by a bounded amount.
+
+Pure policy change inside update_slots(), built on the patch-0016 decode-first
+budget; no new slot states, no batch-formation rewrite, zero libllama changes.
+
+BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence's
+decode logits depend on its own tokens + its own KV only (the paged decode read is
+per-stream, attention is permutation-invariant over the co-batched set), so
+deferring another slot's prefill never changes a generating slot's output.
+DEFAULT-OFF: LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. Does
+not run in the single-sequence greedy md5 gate (that path is llama-completion).
+
+Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
+S1+S3 vs baseline (graphs rebuilt every step): graph reuse 0% -> 72.2%, hostproc
+15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean,
+at vLLM's ~5.9 sustained). Remaining 28% rebuilds are request-boundary D/seq-set
+churn + the prefill-cadence steps; closing them needs a padded/fixed-slot decode
+shape (scoped follow-up, see DECODE_SERVING_SCOPE.md).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
+ 1 file changed, 34 insertions(+), 1 deletion(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index 64775dc..9baca33 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -3138,11 +3138,44 @@ private:
+         }
+         int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
+ 
+        // PAGED serving lever (patch 0041, S3): decode-shape-stable scheduling.
+        // Pairs with the S1 paged decode-graph reuse (patch 0040): S1 makes a
+        // pure-decode step graph-reusable, S3 makes the scheduler EMIT pure-decode
+        // steps. With continuous batching a co-batched prefill chunk inflates the
+        // step from n_tokens==D (pure decode) to D+P, which changes the ubatch
+        // shape and breaks layer-A graph reuse on EVERY step. S3 keeps prefill out
+        // of the decode step: while there is live decode load it runs pure-decode
+        // steps (reuse holds) and admits a prefill chunk only on a bounded cadence
+        // (every LLAMA_PAGED_PREFILL_PERIOD steps, default 8) or when no decode is
+        // active. The deferred prefill chunk still runs within a few steps, so
+        // prompt latency rises by at most (period-1) decode steps.
+        //
+        // BIT-EXACT: this only changes WHICH step a prompt chunk is admitted in.
+        // Each sequence's decode logits depend on its own tokens + its own KV only
+        // (the paged decode read is per-stream, attention is permutation-invariant
+        // over the co-batched set), so deferring another slot's prefill never
+        // changes a generating slot's output. DEFAULT-OFF: env unset => no change,
+        // byte-identical to patch 0016. Does not run in the single-sequence greedy
+        // md5 gate (that path is llama-completion, not update_slots).
+        bool decode_only_step = false;
+        {
+            static const int s3_enabled = [](){ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); return e ? atoi(e) : 0; }();
+            if (s3_enabled && n_decode_in_batch > 0) {
+                static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }();
+                static long s3_step = 0;
+                const bool prefill_due = (s3_step % s3_period) == 0;
+                s3_step++;
+                decode_only_step = !prefill_due;
+            }
+        }
+
+         auto & alora_scale       = batch.alora_scale;
+         auto & alora_disabled_id = batch.alora_disabled_id;
+ 
+         // next, batch any pending prompts without exceeding n_batch
+-        if (params_base.cont_batching || batch.size() == 0) {
+        // (patch 0041, S3) skip prompt admission on a pure-decode step to keep the
+        // decode batch shape reuse-stable
+        if ((params_base.cont_batching || batch.size() == 0) && !decode_only_step) {
+             bool add_ok = true; // false means the batch is full, skip remaining slots
+ 
+             iterate(slots, [&](server_slot & slot) {
+-- 
+2.43.0
+