mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-29 19:06:43 -04:00
fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV
FIX A (patch 0031 compose break): the chunked GDN prefill patch carried
'#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but
those were introduced by the dropped bf16-tau patch 0026, so on the
bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git
apply' failed. The same 0026 drop also shifted 0031's later hunks off their
context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template
params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh
pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the
cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for
INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA &&
!keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF
(GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies
clean on 0ed235ea ('git apply --check' green for every patch).
FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF.
Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still
overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in
update_slots(); re-exported from the dev tree, README 0041 row updated to match.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -135,7 +135,7 @@ hides.
|
||||
| # | What it does | Bit-exact |
|
||||
|---|---|---|
|
||||
| 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) |
|
||||
| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). `LLAMA_PAGED_DECODE_STABLE=1` to enable. | yes (default-off byte-identical; per-stream independent in serving) |
|
||||
| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default ON under paged KV** (enabled when `LLAMA_KV_PAGED` is set; `LLAMA_PAGED_DECODE_STABLE=0` forces it off). | yes (byte-identical on/off; per-stream independent in serving) |
|
||||
|
||||
Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
|
||||
graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**,
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
From c9bf1bd0000000000000000000000000000031aa Mon Sep 17 00:00:00 2001
|
||||
From 37549ecce806130b36012dfd0077ad830989ec71 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Sun, 28 Jun 2026 12:00:00 +0000
|
||||
Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031)
|
||||
|
||||
Date: Sun, 28 Jun 2026 19:30:01 +0000
|
||||
Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch
|
||||
0031)
|
||||
|
||||
Implements the explicit upstream TODO at gated_delta_net.cu's
|
||||
launch_gated_delta_net ("Add chunked kernel for even faster pre-fill"). The
|
||||
@@ -66,24 +66,27 @@ README section 5 (dev notes / rejected-flat levers).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
---
|
||||
ggml/src/ggml-cuda/gated_delta_net.cu | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
tests/test-backend-ops.cpp | 8 ++++++++
|
||||
2 files changed, 243 insertions(+)
|
||||
ggml/src/ggml-cuda/gated_delta_net.cu | 237 ++++++++++++++++++++++++++
|
||||
tests/test-backend-ops.cpp | 8 +
|
||||
2 files changed, 245 insertions(+)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
index 830118a..c9bf1bd 100644
|
||||
index d071d5a..7121d80 100644
|
||||
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
@@ -1,6 +1,7 @@
|
||||
@@ -1,7 +1,10 @@
|
||||
#include "gated_delta_net.cuh"
|
||||
#include "ggml-cuda/common.cuh"
|
||||
|
||||
+#include <climits>
|
||||
#include <cstdlib>
|
||||
#include <cuda_bf16.h>
|
||||
#include <type_traits>
|
||||
@@ -407,6 +408,219 @@ static void launch_gdn_variant(
|
||||
sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head, hyb);
|
||||
+#include <cuda_bf16.h>
|
||||
+#include <type_traits>
|
||||
|
||||
// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
|
||||
// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
|
||||
@@ -279,6 +282,219 @@ static void launch_gdn_variant(
|
||||
sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
|
||||
}
|
||||
|
||||
+// ============================================================================
|
||||
@@ -299,10 +302,10 @@ index 830118a..c9bf1bd 100644
|
||||
+ neqk1_magic, rq3_magic, scale, state_dst_d, ids_d, rs_head);
|
||||
+}
|
||||
+
|
||||
template <bool KDA, bool keep_rs_t, bool STATE_BF16, bool HYBRID>
|
||||
template <bool KDA, bool keep_rs_t>
|
||||
static void launch_gated_delta_net(
|
||||
const float * q_d, const float * k_d, const float * v_d,
|
||||
@@ -425,6 +639,27 @@ static void launch_gated_delta_net(
|
||||
@@ -297,6 +513,27 @@ static void launch_gated_delta_net(
|
||||
const uint3 neqk1_magic = init_fastdiv_values(neqk1);
|
||||
const uint3 rq3_magic = init_fastdiv_values(rq3);
|
||||
|
||||
@@ -311,7 +314,7 @@ index 830118a..c9bf1bd 100644
|
||||
+ // head dim (S_v==128) and a prefill token threshold; decode (n_tokens small) keeps the tuned
|
||||
+ // sequential recurrence. Mathematically equivalent up to FP reduction order (NEW per-path md5;
|
||||
+ // validated benign by test-backend-ops NMSE + greedy output). Toggle: GDN_CHUNK_OFF / GDN_CHUNK_MIN.
|
||||
+ if constexpr (!KDA && !keep_rs_t && !STATE_BF16 && !HYBRID) {
|
||||
+ if constexpr (!KDA && !keep_rs_t) {
|
||||
+ // OPT-IN: this chunked path is bit-exact-benign (test-backend-ops green) but, at C=16
|
||||
+ // (forced by GB10 99KB dyn-smem opt-in, all-shared), it is NOT yet faster than the tuned
|
||||
+ // sequential recurrence on this model (measured ~22%% slower S_PP, grid-starved at low
|
||||
@@ -328,13 +331,13 @@ index 830118a..c9bf1bd 100644
|
||||
+ }
|
||||
+
|
||||
#define GDN_LAUNCH_ARGS \
|
||||
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, hyb, \
|
||||
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
|
||||
H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
|
||||
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
|
||||
index c0233eb..951bffc 100644
|
||||
index ac30e47..4e40d23 100644
|
||||
--- a/tests/test-backend-ops.cpp
|
||||
+++ b/tests/test-backend-ops.cpp
|
||||
@@ -9459,6 +9459,14 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
|
||||
@@ -9398,6 +9398,14 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
|
||||
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1));
|
||||
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 200, 1));
|
||||
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 127, 2));
|
||||
@@ -349,6 +352,6 @@ index c0233eb..951bffc 100644
|
||||
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 64, 1, 1, false, true));
|
||||
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 33, 1, 1, false, true));
|
||||
test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1, 1, false, true));
|
||||
|
||||
--
|
||||
2.43.0
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
From ef2765d85829c9ede2fc9aa90523386d765c9040 Mon Sep 17 00:00:00 2001
|
||||
From ee8021b56ed0effe493a64aa50449ab928dd6b29 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Sun, 28 Jun 2026 20:00:24 +0200
|
||||
Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch
|
||||
@@ -24,7 +24,9 @@ BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence'
|
||||
decode logits depend on its own tokens + its own KV only (the paged decode read is
|
||||
per-stream, attention is permutation-invariant over the co-batched set), so
|
||||
deferring another slot's prefill never changes a generating slot's output.
|
||||
DEFAULT-OFF: LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. Does
|
||||
DEFAULT-ON under paged KV: with LLAMA_KV_PAGED set this enables by default
|
||||
(LLAMA_PAGED_DECODE_STABLE=0 forces it off); otherwise unset => byte-identical to
|
||||
patch 0016. Does
|
||||
not run in the single-sequence greedy md5 gate (that path is llama-completion).
|
||||
|
||||
Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load):
|
||||
@@ -37,14 +39,14 @@ shape (scoped follow-up, see DECODE_SERVING_SCOPE.md).
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
|
||||
1 file changed, 34 insertions(+), 1 deletion(-)
|
||||
tools/server/server-context.cpp | 40 ++++++++++++++++++++++++++++++++-
|
||||
1 file changed, 39 insertions(+), 1 deletion(-)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index 64775dc..9baca33 100644
|
||||
index 64775dc..fc0231a 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -3138,11 +3138,44 @@ private:
|
||||
@@ -3138,11 +3138,49 @@ private:
|
||||
}
|
||||
int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
|
||||
|
||||
@@ -64,12 +66,17 @@ index 64775dc..9baca33 100644
|
||||
+ // Each sequence's decode logits depend on its own tokens + its own KV only
|
||||
+ // (the paged decode read is per-stream, attention is permutation-invariant
|
||||
+ // over the co-batched set), so deferring another slot's prefill never
|
||||
+ // changes a generating slot's output. DEFAULT-OFF: env unset => no change,
|
||||
+ // byte-identical to patch 0016. Does not run in the single-sequence greedy
|
||||
+ // md5 gate (that path is llama-completion, not update_slots).
|
||||
+ // changes a generating slot's output. DEFAULT-ON under paged KV: with
|
||||
+ // LLAMA_KV_PAGED set it enables by default (LLAMA_PAGED_DECODE_STABLE=0
|
||||
+ // forces off); otherwise byte-identical to patch 0016. Does not run in the
|
||||
+ // single-sequence greedy md5 gate (that path is llama-completion, not update_slots).
|
||||
+ bool decode_only_step = false;
|
||||
+ {
|
||||
+ static const int s3_enabled = [](){ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); return e ? atoi(e) : 0; }();
|
||||
+ static const int s3_enabled = [](){
|
||||
+ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE");
|
||||
+ if (e) { return atoi(e); } // explicit override (=0 forces off)
|
||||
+ return getenv("LLAMA_KV_PAGED") != nullptr ? 1 : 0; // default ON under paged KV
|
||||
+ }();
|
||||
+ if (s3_enabled && n_decode_in_batch > 0) {
|
||||
+ static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }();
|
||||
+ static long s3_step = 0;
|
||||
|
||||
Reference in New Issue
Block a user