feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022)

Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the
gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama;
the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM
82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps
the per-column f32 reduction/FMA order byte-identical (md5-gateable). The
bf16-state plan stays shelved.

Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state
instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ...
within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The
per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only
the (warp,block)->column assignment differs and the result is bit-identical;
per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM
latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner
(NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW.

GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above
vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128
335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5
byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4;
test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in
OCCUPANCY_RETUNE_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-25 18:34:17 +00:00
parent 8f8777e0f4
commit 3c1ed67b4b
2 changed files with 522 additions and 0 deletions

View File

@@ -0,0 +1,403 @@
From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 20:32:55 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
occupancy/coalescing retune (patch 0022)
Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
order byte-identical (md5-gateable). The bf16-state plan stays shelved.
Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
across the lanes by the same strided i = r*warp_size + lane mapping, and every
column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
unchanged; only the (warp,block)->column assignment and visit order differ, which a
column's value provably does not depend on (columns are fully independent). This
raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
state-load bursts before any reduction + interleaved butterfly reductions hiding
each other's shfl latency), covering more DRAM latency on this bandwidth-bound
kernel. Every global access stays identically coalesced, so it is a scheduling /
latency-coverage win, not a coalescing change. The forbidden float4 state load
(which would repartition a lane to 4 contiguous rows and change the reduction
grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
GB10 winner (16, 8).
GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
unchanged.
Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
1 file changed, 157 insertions(+), 79 deletions(-)
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index 86d5e2a..d071d5a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -1,6 +1,8 @@
#include "gated_delta_net.cuh"
#include "ggml-cuda/common.cuh"
+#include <cstdlib>
+
// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
// destination slot by the recurrence kernel and are skipped here. One block per sequence.
@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
}
-template <int S_v, bool KDA, bool keep_rs_t>
-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
+// f32 value provably does not depend on either (columns are fully independent: column c reads only
+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
gated_delta_net_cuda(const float * q,
const float * k,
const float * v,
@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
int rs_head) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
- // each warp owns one column, using warp-level primitives to reduce across rows
+ // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
const int lane = threadIdx.x;
- const int col = blockIdx.z * blockDim.y + threadIdx.y;
+ const int col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
const uint32_t iq3 = fastdiv(sequence, rq3_magic);
@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
// writing the same slot per block (identity) is race-free.
const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
? state_dst : curr_state;
- read_state += state_in_offset + col * S_v;
+ read_state += state_in_offset;
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
- float s_shard[rows_per_lane];
- // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
+ // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
+ float s_shard[COLS_PER_WARP][rows_per_lane];
ggml_cuda_pdl_sync();
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- s_shard[r] = read_state[i];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+ const float * rs = read_state + col * S_v;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ s_shard[cc][r] = rs[i];
+ }
}
for (int t = 0; t < n_tokens; t++) {
@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
const float beta_val = *beta_t;
- // Cache k and q in registers
+ // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
float k_reg[rows_per_lane];
float q_reg[rows_per_lane];
#pragma unroll
@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
if constexpr (!KDA) {
const float g_val = expf(*g_t);
- // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
- float kv_shard = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- kv_shard += s_shard[r] * k_reg[r];
- }
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
- // delta[col] = (v[col] - g * kv[col]) * beta
- float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+ // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
+ float kv_shard = 0.0f;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ kv_shard += s_shard[cc][r] * k_reg[r];
+ }
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
- float attn_partial = 0.0f;
+ // delta[col] = (v[col] - g * kv[col]) * beta
+ float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+
+ // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+ float attn_partial = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- s_shard[r] = g_val * s_shard[r] + k_reg[r] * delta_col;
- attn_partial += s_shard[r] * q_reg[r];
- }
+ for (int r = 0; r < rows_per_lane; r++) {
+ s_shard[cc][r] = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
+ attn_partial += s_shard[cc][r] * q_reg[r];
+ }
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- if (lane == 0) {
- attn_data[col] = attn_col * scale;
+ if (lane == 0) {
+ attn_data[col] = attn_col * scale;
+ }
}
} else {
- // kv[col] = sum_i g[i] * S[i][col] * k[i]
- float kv_shard = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
- }
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+
+ // kv[col] = sum_i g[i] * S[i][col] * k[i]
+ float kv_shard = 0.0f;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
+ }
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- // delta[col] = (v[col] - kv[col]) * beta
- float delta_col = (v_t[col] - kv_col) * beta_val;
+ // delta[col] = (v[col] - kv[col]) * beta
+ float delta_col = (v_t[col] - kv_col) * beta_val;
- // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
- float attn_partial = 0.0f;
+ // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+ float attn_partial = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- s_shard[r] = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
- attn_partial += s_shard[r] * q_reg[r];
- }
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ s_shard[cc][r] = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
+ attn_partial += s_shard[cc][r] * q_reg[r];
+ }
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- if (lane == 0) {
- attn_data[col] = attn_col * scale;
+ if (lane == 0) {
+ attn_data[col] = attn_col * scale;
+ }
}
}
@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
const int target_slot = (int) n_tokens - 1 - t;
if (target_slot >= 0 && target_slot < K) {
- float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- curr_state[col * S_v + i] = s_shard[r];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+ float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ curr_state[col * S_v + i] = s_shard[cc][r];
+ }
}
}
}
@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
if constexpr (!keep_rs_t) {
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- state[col * S_v + i] = s_shard[r];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ state[col * S_v + i] = s_shard[cc][r];
+ }
}
}
}
+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
+#ifndef GDN_DEFAULT_NW
+#define GDN_DEFAULT_NW 16
+#endif
+#ifndef GDN_DEFAULT_CPW
+#define GDN_DEFAULT_CPW 8
+#endif
+
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
+static void launch_gdn_variant(
+ const float * q_d, const float * k_d, const float * v_d,
+ const float * g_d, const float * b_d, const float * s_d,
+ float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
+ int64_t H, int64_t n_tokens, int64_t n_seqs,
+ int64_t sq1, int64_t sq2, int64_t sq3,
+ int64_t sv1, int64_t sv2, int64_t sv3,
+ int64_t sb1, int64_t sb2, int64_t sb3,
+ const uint3 neqk1_magic, const uint3 rq3_magic,
+ float scale, int K, int warp_size, cudaStream_t stream) {
+ static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
+ dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
+ dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
+ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
+ ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+ n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+}
+
template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
float scale, int K, cudaStream_t stream) {
//TODO: Add chunked kernel for even faster pre-fill
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
- const int num_warps = 4;
- dim3 grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
- dim3 block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
const uint3 neqk1_magic = init_fastdiv_values(neqk1);
const uint3 rq3_magic = init_fastdiv_values(rq3);
- int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+#define GDN_LAUNCH_ARGS \
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
+ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
+ neqk1_magic, rq3_magic, scale, K, warp_size, stream
- const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
switch (S_v) {
case 16:
- ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
case 32:
- ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
- case 64: {
- ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ case 64:
+ launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
- }
case 128: {
- ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
+ // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
+ // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
+ // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
+ static const int gdn_nw = []{ const char * e = getenv("GDN_NW"); return e ? atoi(e) : GDN_DEFAULT_NW; }();
+ static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
+ // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
+ // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
+ // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
+ if (gdn_nw == 4 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 4 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 4 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8, 8, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
+ else launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
break;
}
default:
GGML_ABORT("fatal error");
break;
}
+
+#undef GDN_LAUNCH_ARGS
}
void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
--
2.43.0

View File

@@ -0,0 +1,119 @@
# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
dtype=torch.float32 elem_bytes=4 shape=(1553, 48, 128, 128) per-slot state = 786432 elems = 3.000 MiB (f32)
- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB (bf16 would be 402.7 MB)
- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32 (mamba_cache_dtype=auto/bf16 for conv)
- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
-> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak (min 84.3%, p90 81.1%)
## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
(Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
## THE GAP (equal f32 bytes, different efficiency)
llama 805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
vLLM 805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
=> vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
---
# retune-build (BUILD AGENT) — patch 0022 SHIPPED
vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
## The change — bit-exact column folding (Lever A + B + D)
`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
value provably does not depend on either (columns are fully independent — column c reads only its own
S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
agent's 4.021 ms / 73.4% exactly (n=1152).
| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
|---------------------------|---------|------|-------|
| base (0021) | 4.021 | 200.3| 73.4 |
| 4 x 1 (control == base) | 4.034 | 199.7| 73.1 |
| 4 x 2 | 3.887 | 207.2| 75.9 |
| 4 x 4 | 3.775 | 213.3| 78.1 |
| 8 x 1 | 3.837 | 209.9| 76.9 |
| 8 x 2 | 3.749 | 214.8| 78.7 |
| 8 x 4 | 3.699 | 217.7| 79.9 |
| 8 x 8 | 3.586 | 224.6| 82.3 |
| 16 x 2 | 3.665 | 219.8| 80.5 |
| 16 x 4 | 3.585 | 224.7| 82.3 |
| **16 x 8 (WINNER/default)** | **3.488** | **230.9** | **84.6** |
| 32 x 4 | 3.489 | 230.8| 84.6 |
Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
## Gates (both PASS, non-negotiable)
- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
(16,8 default) AND (4,1 control):
- q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
- q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
| model | npl | base S_TG t/s | flag S_TG t/s | gain |
|-------|-----|---------------|---------------|------|
| dense 27b | 32 | 199.2 | 207.6 | +4.2% |
| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
| MoE 35b-a3b | 32 | 420.6 | 440.0 | +4.6% |
| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
## Parity vs vLLM (recurrence kernel)
Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
(matmul/attn), not in gated-DeltaNet.
Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.