feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022)

Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA order byte-identical (md5-gateable). The bf16-state plan stays shelved. Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only the (warp,block)->column assignment differs and the result is bit-identical; per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner (NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW. GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128 335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5 byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4; test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in OCCUPANCY_RETUNE_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 17:12:10 -04:00 · 2026-06-25 18:34:17 +00:00
parent 8f8777e0f4
commit 3c1ed67b4b
2 changed files with 522 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
@@ -0,0 +1,403 @@
+From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 20:32:55 +0200
+Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
+ occupancy/coalescing retune (patch 0022)
+
+Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
+kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
+float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
+efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
+the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
+order byte-identical (md5-gateable). The bf16-state plan stays shelved.
+
+Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
+(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
+state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
+... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
+grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
+across the lanes by the same strided i = r*warp_size + lane mapping, and every
+column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
+unchanged; only the (warp,block)->column assignment and visit order differ, which a
+column's value provably does not depend on (columns are fully independent). This
+raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
+state-load bursts before any reduction + interleaved butterfly reductions hiding
+each other's shfl latency), covering more DRAM latency on this bandwidth-bound
+kernel. Every global access stays identically coalesced, so it is a scheduling /
+latency-coverage win, not a coalescing change. The forbidden float4 state load
+(which would repartition a lane to 4 contiguous rows and change the reduction
+grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
+env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
+GB10 winner (16, 8).
+
+GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
+gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
+200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
+102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
+dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
+35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
+unchanged.
+
+Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
+both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
+test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
+ 1 file changed, 157 insertions(+), 79 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
+index 86d5e2a..d071d5a 100644
+--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
+@@ -1,6 +1,8 @@
+ #include "gated_delta_net.cuh"
+ #include "ggml-cuda/common.cuh"
+ 
+#include <cstdlib>
+
+ // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
+ // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+ // destination slot by the recurrence kernel and are skipped here. One block per sequence.
+@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
+     gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
+ }
+ 
+-template <int S_v, bool KDA, bool keep_rs_t>
+-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
+// f32 value provably does not depend on either (columns are fully independent: column c reads only
+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
+ gated_delta_net_cuda(const float * q,
+                                      const float * k,
+                                      const float * v,
+@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
+                                      int           rs_head) {
+     const uint32_t h_idx    = blockIdx.x;
+     const uint32_t sequence = blockIdx.y;
+-    // each warp owns one column, using warp-level primitives to reduce across rows
+    // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
+     const int      lane     = threadIdx.x;
+-    const int      col      = blockIdx.z * blockDim.y + threadIdx.y;
+    const int      col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
+ 
+     const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
+     const uint32_t iq3 = fastdiv(sequence, rq3_magic);
+@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
+     // writing the same slot per block (identity) is race-free.
+     const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
+         ? state_dst : curr_state;
+-    read_state += state_in_offset + col * S_v;
+    read_state += state_in_offset;
+     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
+ 
+     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
+     static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
+     constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
+-    float         s_shard[rows_per_lane];
+-    // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
+    // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
+    float         s_shard[COLS_PER_WARP][rows_per_lane];
+ 
+     ggml_cuda_pdl_sync();
+ #pragma unroll
+-    for (int r = 0; r < rows_per_lane; r++) {
+-        const int i = r * warp_size + lane;
+-        s_shard[r]  = read_state[i];
+    for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+        const int     col = col_base + cc * NUM_WARPS;
+        const float * rs  = read_state + col * S_v;
+#pragma unroll
+        for (int r = 0; r < rows_per_lane; r++) {
+            const int i   = r * warp_size + lane;
+            s_shard[cc][r] = rs[i];
+        }
+     }
+ 
+     for (int t = 0; t < n_tokens; t++) {
+@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
+ 
+         const float beta_val = *beta_t;
+ 
+-        // Cache k and q in registers
+        // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
+         float k_reg[rows_per_lane];
+         float q_reg[rows_per_lane];
+ #pragma unroll
+@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
+         if constexpr (!KDA) {
+             const float g_val = expf(*g_t);
+ 
+-            // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
+-            float kv_shard = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                kv_shard += s_shard[r] * k_reg[r];
+-            }
+-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+                const int col = col_base + cc * NUM_WARPS;
+ 
+-            // delta[col] = (v[col] - g * kv[col]) * beta
+-            float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+                // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
+                float kv_shard = 0.0f;
+#pragma unroll
+                for (int r = 0; r < rows_per_lane; r++) {
+                    kv_shard += s_shard[cc][r] * k_reg[r];
+                }
+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ 
+-            // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
+-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+-            float attn_partial = 0.0f;
+                // delta[col] = (v[col] - g * kv[col]) * beta
+                float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+
+                // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+                float attn_partial = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                s_shard[r]  = g_val * s_shard[r] + k_reg[r] * delta_col;
+-                attn_partial += s_shard[r] * q_reg[r];
+-            }
+                for (int r = 0; r < rows_per_lane; r++) {
+                    s_shard[cc][r]  = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
+                    attn_partial += s_shard[cc][r] * q_reg[r];
+                }
+ 
+-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ 
+-            if (lane == 0) {
+-                attn_data[col] = attn_col * scale;
+                if (lane == 0) {
+                    attn_data[col] = attn_col * scale;
+                }
+             }
+         } else {
+-            // kv[col] = sum_i g[i] * S[i][col] * k[i]
+-            float kv_shard = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                const int i = r * warp_size + lane;
+-                kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
+-            }
+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+                const int col = col_base + cc * NUM_WARPS;
+
+                // kv[col] = sum_i g[i] * S[i][col] * k[i]
+                float kv_shard = 0.0f;
+#pragma unroll
+                for (int r = 0; r < rows_per_lane; r++) {
+                    const int i = r * warp_size + lane;
+                    kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
+                }
+ 
+-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ 
+-            // delta[col] = (v[col] - kv[col]) * beta
+-            float delta_col = (v_t[col] - kv_col) * beta_val;
+                // delta[col] = (v[col] - kv[col]) * beta
+                float delta_col = (v_t[col] - kv_col) * beta_val;
+ 
+-            // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
+-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+-            float attn_partial = 0.0f;
+                // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+                float attn_partial = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                const int i = r * warp_size + lane;
+-                s_shard[r]  = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
+-                attn_partial += s_shard[r] * q_reg[r];
+-            }
+                for (int r = 0; r < rows_per_lane; r++) {
+                    const int i = r * warp_size + lane;
+                    s_shard[cc][r]  = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
+                    attn_partial += s_shard[cc][r] * q_reg[r];
+                }
+ 
+-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ 
+-            if (lane == 0) {
+-                attn_data[col] = attn_col * scale;
+                if (lane == 0) {
+                    attn_data[col] = attn_col * scale;
+                }
+             }
+         }
+ 
+@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
+             const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
+             const int target_slot = (int) n_tokens - 1 - t;
+             if (target_slot >= 0 && target_slot < K) {
+-                float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
+ #pragma unroll
+-                for (int r = 0; r < rows_per_lane; r++) {
+-                    const int i = r * warp_size + lane;
+-                    curr_state[col * S_v + i] = s_shard[r];
+                for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+                    const int col = col_base + cc * NUM_WARPS;
+                    float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
+#pragma unroll
+                    for (int r = 0; r < rows_per_lane; r++) {
+                        const int i = r * warp_size + lane;
+                        curr_state[col * S_v + i] = s_shard[cc][r];
+                    }
+                 }
+             }
+         }
+@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
+ 
+     if constexpr (!keep_rs_t) {
+ #pragma unroll
+-        for (int r = 0; r < rows_per_lane; r++) {
+-            const int i          = r * warp_size + lane;
+-            state[col * S_v + i] = s_shard[r];
+        for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+            const int col = col_base + cc * NUM_WARPS;
+#pragma unroll
+            for (int r = 0; r < rows_per_lane; r++) {
+                const int i          = r * warp_size + lane;
+                state[col * S_v + i] = s_shard[cc][r];
+            }
+         }
+     }
+ }
+ 
+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
+#ifndef GDN_DEFAULT_NW
+#define GDN_DEFAULT_NW 16
+#endif
+#ifndef GDN_DEFAULT_CPW
+#define GDN_DEFAULT_CPW 8
+#endif
+
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
+static void launch_gdn_variant(
+        const float * q_d, const float * k_d, const float * v_d,
+        const float * g_d, const float * b_d, const float * s_d,
+        float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
+        int64_t H, int64_t n_tokens, int64_t n_seqs,
+        int64_t sq1, int64_t sq2, int64_t sq3,
+        int64_t sv1, int64_t sv2, int64_t sv3,
+        int64_t sb1, int64_t sb2, int64_t sb3,
+        const uint3 neqk1_magic, const uint3 rq3_magic,
+        float scale, int K, int warp_size, cudaStream_t stream) {
+    static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
+    dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
+    dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
+    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
+    ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+        n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+        sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+}
+
+ template <bool KDA, bool keep_rs_t>
+ static void launch_gated_delta_net(
+         const float * q_d, const float * k_d, const float * v_d,
+@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
+         float scale, int K, cudaStream_t stream) {
+     //TODO: Add chunked kernel for even faster pre-fill
+     const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
+-    const int num_warps = 4;
+-    dim3      grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
+-    dim3      block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
+ 
+     const uint3 neqk1_magic = init_fastdiv_values(neqk1);
+     const uint3 rq3_magic   = init_fastdiv_values(rq3);
+ 
+-    int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+#define GDN_LAUNCH_ARGS \
+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
+        H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
+        neqk1_magic, rq3_magic, scale, K, warp_size, stream
+ 
+-    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
+     switch (S_v) {
+         case 16:
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+            launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+             break;
+         case 32:
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+            launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+             break;
+-        case 64: {
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+        case 64:
+            launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+             break;
+-        }
+         case 128: {
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+            // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
+            // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
+            // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
+            // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
+            static const int gdn_nw  = []{ const char * e = getenv("GDN_NW");  return e ? atoi(e) : GDN_DEFAULT_NW;  }();
+            static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
+            // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
+            // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
+            // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
+            if      (gdn_nw == 4  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4,  1, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 4  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4,  2, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 4  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4,  4, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8,  1, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8,  2, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8,  4, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8,  8, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
+            else                                   launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
+             break;
+         }
+         default:
+             GGML_ABORT("fatal error");
+             break;
+     }
+
+#undef GDN_LAUNCH_ARGS
+ }
+ 
+ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
@@ -0,0 +1,119 @@
+# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
+
+Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
+Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
+NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
+Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
+
+## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
+
+vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
+Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
+
+## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
+- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
+    dtype=torch.float32  elem_bytes=4  shape=(1553, 48, 128, 128)  per-slot state = 786432 elems = 3.000 MiB (f32)
+- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB  (bf16 would be 402.7 MB)
+- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32  (mamba_cache_dtype=auto/bf16 for conv)
+- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
+    sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
+    -> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
+- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
+    BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak  (min 84.3%, p90 81.1%)
+
+## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
+- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
+- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
+- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
+    gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
+    Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
+      median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
+    (Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
+
+## THE GAP (equal f32 bytes, different efficiency)
+  llama   805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
+  vLLM    805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
+  => vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
+  Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
+  reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
+
+---
+
+# retune-build (BUILD AGENT) — patch 0022 SHIPPED
+
+vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
+is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
+
+## The change — bit-exact column folding (Lever A + B + D)
+
+`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
+`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
+`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
+body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
+`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
+
+Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
+mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
+`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
+`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
+value provably does not depend on either (columns are fully independent — column c reads only its own
+S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
+state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
+grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
+(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
+per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
+reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
+coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
+default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
+
+## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
+
+Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
+isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
+agent's 4.021 ms / 73.4% exactly (n=1152).
+
+| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
+|---------------------------|---------|------|-------|
+| base (0021)               | 4.021   | 200.3| 73.4  |
+| 4 x 1 (control == base)   | 4.034   | 199.7| 73.1  |
+| 4 x 2                     | 3.887   | 207.2| 75.9  |
+| 4 x 4                     | 3.775   | 213.3| 78.1  |
+| 8 x 1                     | 3.837   | 209.9| 76.9  |
+| 8 x 2                     | 3.749   | 214.8| 78.7  |
+| 8 x 4                     | 3.699   | 217.7| 79.9  |
+| 8 x 8                     | 3.586   | 224.6| 82.3  |
+| 16 x 2                    | 3.665   | 219.8| 80.5  |
+| 16 x 4                    | 3.585   | 224.7| 82.3  |
+| **16 x 8  (WINNER/default)** | **3.488** | **230.9** | **84.6** |
+| 32 x 4                    | 3.489   | 230.8| 84.6  |
+
+Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
+1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
+
+## Gates (both PASS, non-negotiable)
+
+- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
+  (16,8 default) AND (4,1 control):
+  - q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
+  - q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
+- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
+
+## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
+
+| model | npl | base S_TG t/s | flag S_TG t/s | gain |
+|-------|-----|---------------|---------------|------|
+| dense 27b | 32  | 199.2 | 207.6 | +4.2% |
+| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
+| MoE 35b-a3b | 32  | 420.6 | 440.0 | +4.6% |
+| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
+
+Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
+
+## Parity vs vLLM (recurrence kernel)
+
+Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
+(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
+is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
+(matmul/attn), not in gated-DeltaNet.
+
+Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.