fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV

FIX A (patch 0031 compose break): the chunked GDN prefill patch carried '#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but those were introduced by the dropped bf16-tau patch 0026, so on the bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git apply' failed. The same 0026 drop also shifted 0031's later hunks off their context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA && !keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF (GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies clean on 0ed235ea ('git apply --check' green for every patch). FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF. Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in update_slots(); re-exported from the dev tree, README 0041 row updated to match. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 19:06:43 -04:00 · 2026-06-28 19:37:05 +00:00
parent d706980c2b
commit 2fa8ef8fc5
3 changed files with 41 additions and 31 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -135,7 +135,7 @@ hides.
 | # | What it does | Bit-exact |
 |---|---|---|
 | 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) |
-| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). `LLAMA_PAGED_DECODE_STABLE=1` to enable. | yes (default-off byte-identical; per-stream independent in serving) |
+| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default ON under paged KV** (enabled when `LLAMA_KV_PAGED` is set; `LLAMA_PAGED_DECODE_STABLE=0` forces it off). | yes (byte-identical on/off; per-stream independent in serving) |

 Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
 graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**,
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch
@@ -1,8 +1,8 @@
-From c9bf1bd0000000000000000000000000000031aa Mon Sep 17 00:00:00 2001
+From 37549ecce806130b36012dfd0077ad830989ec71 Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
-Date: Sun, 28 Jun 2026 12:00:00 +0000
-Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031)
-
+Date: Sun, 28 Jun 2026 19:30:01 +0000
+Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch
+ 0031)

 Implements the explicit upstream TODO at gated_delta_net.cu's
 launch_gated_delta_net ("Add chunked kernel for even faster pre-fill"). The
@@ -66,24 +66,27 @@ README section 5 (dev notes / rejected-flat levers).

 Assisted-by: Claude:opus-4.8 [Claude Code]
 ---
- ggml/src/ggml-cuda/gated_delta_net.cu |  235 ++++++++++++++++++++++++++++++++++++++++++++++++++++
- tests/test-backend-ops.cpp            |    8 ++++++++
- 2 files changed, 243 insertions(+)
+ ggml/src/ggml-cuda/gated_delta_net.cu | 237 ++++++++++++++++++++++++++
+ tests/test-backend-ops.cpp            |   8 +
+ 2 files changed, 245 insertions(+)

 diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
-index 830118a..c9bf1bd 100644
+index d071d5a..7121d80 100644
 --- a/ggml/src/ggml-cuda/gated_delta_net.cu
 +++ b/ggml/src/ggml-cuda/gated_delta_net.cu
-@@ -1,6 +1,7 @@
+@@ -1,7 +1,10 @@
 #include "gated_delta_net.cuh"
 #include "ggml-cuda/common.cuh"
 
 +#include <climits>
 #include <cstdlib>
- #include <cuda_bf16.h>
- #include <type_traits>
-@@ -407,6 +408,219 @@ static void launch_gdn_variant(
-         sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head, hyb);
+#include <cuda_bf16.h>
+#include <type_traits>
+ 
+ // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
+ // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+@@ -279,6 +282,219 @@ static void launch_gdn_variant(
+         sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
 }
 
 +// ============================================================================
@@ -299,10 +302,10 @@ index 830118a..c9bf1bd 100644
 +        neqk1_magic, rq3_magic, scale, state_dst_d, ids_d, rs_head);
 +}
 +
- template <bool KDA, bool keep_rs_t, bool STATE_BF16, bool HYBRID>
+ template <bool KDA, bool keep_rs_t>
 static void launch_gated_delta_net(
         const float * q_d, const float * k_d, const float * v_d,
-@@ -425,6 +639,27 @@ static void launch_gated_delta_net(
+@@ -297,6 +513,27 @@ static void launch_gated_delta_net(
     const uint3 neqk1_magic = init_fastdiv_values(neqk1);
     const uint3 rq3_magic   = init_fastdiv_values(rq3);
 
@@ -311,7 +314,7 @@ index 830118a..c9bf1bd 100644
 +    // head dim (S_v==128) and a prefill token threshold; decode (n_tokens small) keeps the tuned
 +    // sequential recurrence. Mathematically equivalent up to FP reduction order (NEW per-path md5;
 +    // validated benign by test-backend-ops NMSE + greedy output). Toggle: GDN_CHUNK_OFF / GDN_CHUNK_MIN.
-+    if constexpr (!KDA && !keep_rs_t && !STATE_BF16 && !HYBRID) {
+    if constexpr (!KDA && !keep_rs_t) {
 +        // OPT-IN: this chunked path is bit-exact-benign (test-backend-ops green) but, at C=16
 +        // (forced by GB10 99KB dyn-smem opt-in, all-shared), it is NOT yet faster than the tuned
 +        // sequential recurrence on this model (measured ~22%% slower S_PP, grid-starved at low
@@ -328,13 +331,13 @@ index 830118a..c9bf1bd 100644
 +    }
 +
 #define GDN_LAUNCH_ARGS \
-         q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, hyb, \
+         q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
         H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
 diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index c0233eb..951bffc 100644
+index ac30e47..4e40d23 100644
 --- a/tests/test-backend-ops.cpp
 +++ b/tests/test-backend-ops.cpp
-@@ -9459,6 +9459,14 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+@@ -9398,6 +9398,14 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
     test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1));
     test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 200, 1));
     test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 127, 2));
@@ -349,6 +352,6 @@ index c0233eb..951bffc 100644
     test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64,  64, 1, 1, false, true));
     test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64,  33, 1, 1, false, true));
     test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1, 1, false, true));
-
 -- 
 2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch
@@ -1,4 +1,4 @@
-From ef2765d85829c9ede2fc9aa90523386d765c9040 Mon Sep 17 00:00:00 2001
+From ee8021b56ed0effe493a64aa50449ab928dd6b29 Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
 Date: Sun, 28 Jun 2026 20:00:24 +0200
 Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch
@@ -24,7 +24,9 @@ BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence'
 decode logits depend on its own tokens + its own KV only (the paged decode read is
 per-stream, attention is permutation-invariant over the co-batched set), so
 deferring another slot's prefill never changes a generating slot's output.
-DEFAULT-OFF: LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. Does
+DEFAULT-ON under paged KV: with LLAMA_KV_PAGED set this enables by default
+(LLAMA_PAGED_DECODE_STABLE=0 forces it off); otherwise unset => byte-identical to
+patch 0016. Does
 not run in the single-sequence greedy md5 gate (that path is llama-completion).

 Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
@@ -37,14 +39,14 @@ shape (scoped follow-up, see DECODE_SERVING_SCOPE.md).
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
- 1 file changed, 34 insertions(+), 1 deletion(-)
+ tools/server/server-context.cpp | 40 ++++++++++++++++++++++++++++++++-
+ 1 file changed, 39 insertions(+), 1 deletion(-)

 diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 64775dc..9baca33 100644
+index 64775dc..fc0231a 100644
 --- a/tools/server/server-context.cpp
 +++ b/tools/server/server-context.cpp
-@@ -3138,11 +3138,44 @@ private:
+@@ -3138,11 +3138,49 @@ private:
         }
         int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
 
@@ -64,12 +66,17 @@ index 64775dc..9baca33 100644
 +        // Each sequence's decode logits depend on its own tokens + its own KV only
 +        // (the paged decode read is per-stream, attention is permutation-invariant
 +        // over the co-batched set), so deferring another slot's prefill never
-+        // changes a generating slot's output. DEFAULT-OFF: env unset => no change,
-+        // byte-identical to patch 0016. Does not run in the single-sequence greedy
-+        // md5 gate (that path is llama-completion, not update_slots).
+        // changes a generating slot's output. DEFAULT-ON under paged KV: with
+        // LLAMA_KV_PAGED set it enables by default (LLAMA_PAGED_DECODE_STABLE=0
+        // forces off); otherwise byte-identical to patch 0016. Does not run in the
+        // single-sequence greedy md5 gate (that path is llama-completion, not update_slots).
 +        bool decode_only_step = false;
 +        {
-+            static const int s3_enabled = [](){ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); return e ? atoi(e) : 0; }();
+            static const int s3_enabled = [](){
+                const char * e = getenv("LLAMA_PAGED_DECODE_STABLE");
+                if (e) { return atoi(e); }                          // explicit override (=0 forces off)
+                return getenv("LLAMA_KV_PAGED") != nullptr ? 1 : 0; // default ON under paged KV
+            }();
 +            if (s3_enabled && n_decode_in_batch > 0) {
 +                static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }();
 +                static long s3_step = 0;