fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV

FIX A (patch 0031 compose break): the chunked GDN prefill patch carried
'#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but
those were introduced by the dropped bf16-tau patch 0026, so on the
bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git
apply' failed. The same 0026 drop also shifted 0031's later hunks off their
context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template
params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh
pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the
cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for
INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA &&
!keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF
(GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies
clean on 0ed235ea ('git apply --check' green for every patch).

FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF.
Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still
overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in
update_slots(); re-exported from the dev tree, README 0041 row updated to match.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-28 19:37:05 +00:00
parent d706980c2b
commit 2fa8ef8fc5
3 changed files with 41 additions and 31 deletions

View File

@@ -135,7 +135,7 @@ hides.
| # | What it does | Bit-exact |
|---|---|---|
| 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) |
| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). `LLAMA_PAGED_DECODE_STABLE=1` to enable. | yes (default-off byte-identical; per-stream independent in serving) |
| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default ON under paged KV** (enabled when `LLAMA_KV_PAGED` is set; `LLAMA_PAGED_DECODE_STABLE=0` forces it off). | yes (byte-identical on/off; per-stream independent in serving) |
Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**,

View File

@@ -1,8 +1,8 @@
From c9bf1bd0000000000000000000000000000031aa Mon Sep 17 00:00:00 2001
From 37549ecce806130b36012dfd0077ad830989ec71 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sun, 28 Jun 2026 12:00:00 +0000
Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031)
Date: Sun, 28 Jun 2026 19:30:01 +0000
Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch
0031)
Implements the explicit upstream TODO at gated_delta_net.cu's
launch_gated_delta_net ("Add chunked kernel for even faster pre-fill"). The
@@ -66,24 +66,27 @@ README section 5 (dev notes / rejected-flat levers).
Assisted-by: Claude:opus-4.8 [Claude Code]
---
ggml/src/ggml-cuda/gated_delta_net.cu | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++
tests/test-backend-ops.cpp | 8 ++++++++
2 files changed, 243 insertions(+)
ggml/src/ggml-cuda/gated_delta_net.cu | 237 ++++++++++++++++++++++++++
tests/test-backend-ops.cpp | 8 +
2 files changed, 245 insertions(+)
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index 830118a..c9bf1bd 100644
index d071d5a..7121d80 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -1,6 +1,7 @@
@@ -1,7 +1,10 @@
#include "gated_delta_net.cuh"
#include "ggml-cuda/common.cuh"
+#include <climits>
#include <cstdlib>
#include <cuda_bf16.h>
#include <type_traits>
@@ -407,6 +408,219 @@ static void launch_gdn_variant(
sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head, hyb);
+#include <cuda_bf16.h>
+#include <type_traits>
// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
@@ -279,6 +282,219 @@ static void launch_gdn_variant(
sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
}
+// ============================================================================
@@ -299,10 +302,10 @@ index 830118a..c9bf1bd 100644
+ neqk1_magic, rq3_magic, scale, state_dst_d, ids_d, rs_head);
+}
+
template <bool KDA, bool keep_rs_t, bool STATE_BF16, bool HYBRID>
template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
@@ -425,6 +639,27 @@ static void launch_gated_delta_net(
@@ -297,6 +513,27 @@ static void launch_gated_delta_net(
const uint3 neqk1_magic = init_fastdiv_values(neqk1);
const uint3 rq3_magic = init_fastdiv_values(rq3);
@@ -311,7 +314,7 @@ index 830118a..c9bf1bd 100644
+ // head dim (S_v==128) and a prefill token threshold; decode (n_tokens small) keeps the tuned
+ // sequential recurrence. Mathematically equivalent up to FP reduction order (NEW per-path md5;
+ // validated benign by test-backend-ops NMSE + greedy output). Toggle: GDN_CHUNK_OFF / GDN_CHUNK_MIN.
+ if constexpr (!KDA && !keep_rs_t && !STATE_BF16 && !HYBRID) {
+ if constexpr (!KDA && !keep_rs_t) {
+ // OPT-IN: this chunked path is bit-exact-benign (test-backend-ops green) but, at C=16
+ // (forced by GB10 99KB dyn-smem opt-in, all-shared), it is NOT yet faster than the tuned
+ // sequential recurrence on this model (measured ~22%% slower S_PP, grid-starved at low
@@ -328,13 +331,13 @@ index 830118a..c9bf1bd 100644
+ }
+
#define GDN_LAUNCH_ARGS \
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, hyb, \
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index c0233eb..951bffc 100644
index ac30e47..4e40d23 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -9459,6 +9459,14 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
@@ -9398,6 +9398,14 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1));
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 200, 1));
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 127, 2));
@@ -349,6 +352,6 @@ index c0233eb..951bffc 100644
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 64, 1, 1, false, true));
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 33, 1, 1, false, true));
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1, 1, false, true));
--
2.43.0

View File

@@ -1,4 +1,4 @@
From ef2765d85829c9ede2fc9aa90523386d765c9040 Mon Sep 17 00:00:00 2001
From ee8021b56ed0effe493a64aa50449ab928dd6b29 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sun, 28 Jun 2026 20:00:24 +0200
Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch
@@ -24,7 +24,9 @@ BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence'
decode logits depend on its own tokens + its own KV only (the paged decode read is
per-stream, attention is permutation-invariant over the co-batched set), so
deferring another slot's prefill never changes a generating slot's output.
DEFAULT-OFF: LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. Does
DEFAULT-ON under paged KV: with LLAMA_KV_PAGED set this enables by default
(LLAMA_PAGED_DECODE_STABLE=0 forces it off); otherwise unset => byte-identical to
patch 0016. Does
not run in the single-sequence greedy md5 gate (that path is llama-completion).
Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
@@ -37,14 +39,14 @@ shape (scoped follow-up, see DECODE_SERVING_SCOPE.md).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
tools/server/server-context.cpp | 40 ++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 64775dc..9baca33 100644
index 64775dc..fc0231a 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -3138,11 +3138,44 @@ private:
@@ -3138,11 +3138,49 @@ private:
}
int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
@@ -64,12 +66,17 @@ index 64775dc..9baca33 100644
+ // Each sequence's decode logits depend on its own tokens + its own KV only
+ // (the paged decode read is per-stream, attention is permutation-invariant
+ // over the co-batched set), so deferring another slot's prefill never
+ // changes a generating slot's output. DEFAULT-OFF: env unset => no change,
+ // byte-identical to patch 0016. Does not run in the single-sequence greedy
+ // md5 gate (that path is llama-completion, not update_slots).
+ // changes a generating slot's output. DEFAULT-ON under paged KV: with
+ // LLAMA_KV_PAGED set it enables by default (LLAMA_PAGED_DECODE_STABLE=0
+ // forces off); otherwise byte-identical to patch 0016. Does not run in the
+ // single-sequence greedy md5 gate (that path is llama-completion, not update_slots).
+ bool decode_only_step = false;
+ {
+ static const int s3_enabled = [](){ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); return e ? atoi(e) : 0; }();
+ static const int s3_enabled = [](){
+ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE");
+ if (e) { return atoi(e); } // explicit override (=0 forces off)
+ return getenv("LLAMA_KV_PAGED") != nullptr ? 1 : 0; // default ON under paged KV
+ }();
+ if (s3_enabled && n_decode_in_batch > 0) {
+ static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }();
+ static long s3_step = 0;