From 2fa8ef8fc53ccb2d932a1a89472486dfc80e0b59 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sun, 28 Jun 2026 19:37:05 +0000 Subject: [PATCH] fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV FIX A (patch 0031 compose break): the chunked GDN prefill patch carried '#include ' and '#include ' as CONTEXT lines, but those were introduced by the dropped bf16-tau patch 0026, so on the bf16-tau-free 0001-0030 base only '#include ' is present and 'git apply' failed. The same 0026 drop also shifted 0031's later hunks off their context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA && !keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF (GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies clean on 0ed235ea ('git apply --check' green for every patch). FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF. Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in update_slots(); re-exported from the dev tree, README 0041 row updated to match. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- backend/cpp/llama-cpp-localai-paged/README.md | 2 +- ...aged-chunked-gdn-prefill-scan-kernel.patch | 43 ++++++++++--------- ...code-shape-stable-scheduling-patch-0.patch | 27 +++++++----- 3 files changed, 41 insertions(+), 31 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index f27bedd13..305342ea2 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -135,7 +135,7 @@ hides. | # | What it does | Bit-exact | |---|---|---| | 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) | -| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). `LLAMA_PAGED_DECODE_STABLE=1` to enable. | yes (default-off byte-identical; per-stream independent in serving) | +| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default ON under paged KV** (enabled when `LLAMA_KV_PAGED` is set; `LLAMA_PAGED_DECODE_STABLE=0` forces it off). | yes (byte-identical on/off; per-stream independent in serving) | Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load): graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**, diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch index 259ed11ec..777fb5fda 100644 --- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch @@ -1,8 +1,8 @@ -From c9bf1bd0000000000000000000000000000031aa Mon Sep 17 00:00:00 2001 +From 37549ecce806130b36012dfd0077ad830989ec71 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto -Date: Sun, 28 Jun 2026 12:00:00 +0000 -Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031) - +Date: Sun, 28 Jun 2026 19:30:01 +0000 +Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch + 0031) Implements the explicit upstream TODO at gated_delta_net.cu's launch_gated_delta_net ("Add chunked kernel for even faster pre-fill"). The @@ -66,24 +66,27 @@ README section 5 (dev notes / rejected-flat levers). Assisted-by: Claude:opus-4.8 [Claude Code] --- - ggml/src/ggml-cuda/gated_delta_net.cu | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++ - tests/test-backend-ops.cpp | 8 ++++++++ - 2 files changed, 243 insertions(+) + ggml/src/ggml-cuda/gated_delta_net.cu | 237 ++++++++++++++++++++++++++ + tests/test-backend-ops.cpp | 8 + + 2 files changed, 245 insertions(+) diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu -index 830118a..c9bf1bd 100644 +index d071d5a..7121d80 100644 --- a/ggml/src/ggml-cuda/gated_delta_net.cu +++ b/ggml/src/ggml-cuda/gated_delta_net.cu -@@ -1,6 +1,7 @@ +@@ -1,7 +1,10 @@ #include "gated_delta_net.cuh" #include "ggml-cuda/common.cuh" +#include #include - #include - #include -@@ -407,6 +408,219 @@ static void launch_gdn_variant( - sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head, hyb); ++#include ++#include + + // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a + // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the +@@ -279,6 +282,219 @@ static void launch_gdn_variant( + sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); } +// ============================================================================ @@ -299,10 +302,10 @@ index 830118a..c9bf1bd 100644 + neqk1_magic, rq3_magic, scale, state_dst_d, ids_d, rs_head); +} + - template + template static void launch_gated_delta_net( const float * q_d, const float * k_d, const float * v_d, -@@ -425,6 +639,27 @@ static void launch_gated_delta_net( +@@ -297,6 +513,27 @@ static void launch_gated_delta_net( const uint3 neqk1_magic = init_fastdiv_values(neqk1); const uint3 rq3_magic = init_fastdiv_values(rq3); @@ -311,7 +314,7 @@ index 830118a..c9bf1bd 100644 + // head dim (S_v==128) and a prefill token threshold; decode (n_tokens small) keeps the tuned + // sequential recurrence. Mathematically equivalent up to FP reduction order (NEW per-path md5; + // validated benign by test-backend-ops NMSE + greedy output). Toggle: GDN_CHUNK_OFF / GDN_CHUNK_MIN. -+ if constexpr (!KDA && !keep_rs_t && !STATE_BF16 && !HYBRID) { ++ if constexpr (!KDA && !keep_rs_t) { + // OPT-IN: this chunked path is bit-exact-benign (test-backend-ops green) but, at C=16 + // (forced by GB10 99KB dyn-smem opt-in, all-shared), it is NOT yet faster than the tuned + // sequential recurrence on this model (measured ~22%% slower S_PP, grid-starved at low @@ -328,13 +331,13 @@ index 830118a..c9bf1bd 100644 + } + #define GDN_LAUNCH_ARGS \ - q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, hyb, \ + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \ diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp -index c0233eb..951bffc 100644 +index ac30e47..4e40d23 100644 --- a/tests/test-backend-ops.cpp +++ b/tests/test-backend-ops.cpp -@@ -9459,6 +9459,14 @@ static std::vector> make_test_cases_eval() { +@@ -9398,6 +9398,14 @@ static std::vector> make_test_cases_eval() { test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1)); test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 200, 1)); test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 127, 2)); @@ -349,6 +352,6 @@ index c0233eb..951bffc 100644 test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 64, 1, 1, false, true)); test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 33, 1, 1, false, true)); test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1, 1, false, true)); - -- 2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch index 9b23a7e6e..39867f1fa 100644 --- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch @@ -1,4 +1,4 @@ -From ef2765d85829c9ede2fc9aa90523386d765c9040 Mon Sep 17 00:00:00 2001 +From ee8021b56ed0effe493a64aa50449ab928dd6b29 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sun, 28 Jun 2026 20:00:24 +0200 Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch @@ -24,7 +24,9 @@ BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence' decode logits depend on its own tokens + its own KV only (the paged decode read is per-stream, attention is permutation-invariant over the co-batched set), so deferring another slot's prefill never changes a generating slot's output. -DEFAULT-OFF: LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. Does +DEFAULT-ON under paged KV: with LLAMA_KV_PAGED set this enables by default +(LLAMA_PAGED_DECODE_STABLE=0 forces it off); otherwise unset => byte-identical to +patch 0016. Does not run in the single-sequence greedy md5 gate (that path is llama-completion). Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load): @@ -37,14 +39,14 @@ shape (scoped follow-up, see DECODE_SERVING_SCOPE.md). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- - tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++- - 1 file changed, 34 insertions(+), 1 deletion(-) + tools/server/server-context.cpp | 40 ++++++++++++++++++++++++++++++++- + 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp -index 64775dc..9baca33 100644 +index 64775dc..fc0231a 100644 --- a/tools/server/server-context.cpp +++ b/tools/server/server-context.cpp -@@ -3138,11 +3138,44 @@ private: +@@ -3138,11 +3138,49 @@ private: } int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots) @@ -64,12 +66,17 @@ index 64775dc..9baca33 100644 + // Each sequence's decode logits depend on its own tokens + its own KV only + // (the paged decode read is per-stream, attention is permutation-invariant + // over the co-batched set), so deferring another slot's prefill never -+ // changes a generating slot's output. DEFAULT-OFF: env unset => no change, -+ // byte-identical to patch 0016. Does not run in the single-sequence greedy -+ // md5 gate (that path is llama-completion, not update_slots). ++ // changes a generating slot's output. DEFAULT-ON under paged KV: with ++ // LLAMA_KV_PAGED set it enables by default (LLAMA_PAGED_DECODE_STABLE=0 ++ // forces off); otherwise byte-identical to patch 0016. Does not run in the ++ // single-sequence greedy md5 gate (that path is llama-completion, not update_slots). + bool decode_only_step = false; + { -+ static const int s3_enabled = [](){ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); return e ? atoi(e) : 0; }(); ++ static const int s3_enabled = [](){ ++ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); ++ if (e) { return atoi(e); } // explicit override (=0 forces off) ++ return getenv("LLAMA_KV_PAGED") != nullptr ? 1 : 0; // default ON under paged KV ++ }(); + if (s3_enabled && n_decode_in_batch > 0) { + static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }(); + static long s3_step = 0;