diff --git a/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch b/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch new file mode 100644 index 000000000..6b6eae468 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch @@ -0,0 +1,403 @@ +From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 20:32:55 +0200 +Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode + occupancy/coalescing retune (patch 0022) + +Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence +kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in +float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth +efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak), +the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA +order byte-identical (md5-gateable). The bf16-state plan stays shelved. + +Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP +(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent +state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, +... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns; +grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded +across the lanes by the same strided i = r*warp_size + lane mapping, and every +column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte +unchanged; only the (warp,block)->column assignment and visit order differ, which a +column's value provably does not depend on (columns are fully independent). This +raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent +state-load bursts before any reduction + interleaved butterfly reductions hiding +each other's shfl latency), covering more DRAM latency on this bandwidth-bound +kernel. Every global access stays identically coalesced, so it is a scheduling / +latency-coverage win, not a coalescing change. The forbidden float4 state load +(which would repartition a lane to 4 contiguous rows and change the reduction +grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is +env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured +GB10 winner (16, 8). + +GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked): +gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call, +200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%; +102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on): +dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE +35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill +unchanged. + +Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on +both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control); +test-backend-ops -o GATED_DELTA_NET 36/36 PASS. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++--------- + 1 file changed, 157 insertions(+), 79 deletions(-) + +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index 86d5e2a..d071d5a 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -1,6 +1,8 @@ + #include "gated_delta_net.cuh" + #include "ggml-cuda/common.cuh" + ++#include ++ + // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a + // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the + // destination slot by the recurrence kernel and are skipped here. One block per sequence. +@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i + gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs); + } + +-template +-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2) ++// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent ++// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a ++// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded ++// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane ++// FMA accumulation and warp_reduce_sum butterfly are byte-for-byte unchanged. Only the ++// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's ++// f32 value provably does not depend on either (columns are fully independent: column c reads only ++// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result ++// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable), ++// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent ++// state-load bursts issued before any reduction, and the independent butterfly reductions interleave ++// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel. ++// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B ++// sector), so this is a latency-coverage / scheduling win, not a coalescing change. ++template ++__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS) + gated_delta_net_cuda(const float * q, + const float * k, + const float * v, +@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q, + int rs_head) { + const uint32_t h_idx = blockIdx.x; + const uint32_t sequence = blockIdx.y; +- // each warp owns one column, using warp-level primitives to reduce across rows ++ // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows. + const int lane = threadIdx.x; +- const int col = blockIdx.z * blockDim.y + threadIdx.y; ++ const int col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y; + + const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic); + const uint32_t iq3 = fastdiv(sequence, rq3_magic); +@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q, + // writing the same slot per block (identity) is race-free. + const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence) + ? state_dst : curr_state; +- read_state += state_in_offset + col * S_v; ++ read_state += state_in_offset; + attn_data += (sequence * n_tokens * H + h_idx) * S_v; + + constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v; + static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size"); + constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size; +- float s_shard[rows_per_lane]; +- // state is stored transposed: M[col][i] = S[i][col], row col is contiguous ++ // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col]. ++ float s_shard[COLS_PER_WARP][rows_per_lane]; + + ggml_cuda_pdl_sync(); + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- s_shard[r] = read_state[i]; ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++ const float * rs = read_state + col * S_v; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ s_shard[cc][r] = rs[i]; ++ } + } + + for (int t = 0; t < n_tokens; t++) { +@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q, + + const float beta_val = *beta_t; + +- // Cache k and q in registers ++ // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp). + float k_reg[rows_per_lane]; + float q_reg[rows_per_lane]; + #pragma unroll +@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q, + if constexpr (!KDA) { + const float g_val = expf(*g_t); + +- // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i] +- float kv_shard = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- kv_shard += s_shard[r] * k_reg[r]; +- } +- float kv_col = warp_reduce_sum(kv_shard); ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; + +- // delta[col] = (v[col] - g * kv[col]) * beta +- float delta_col = (v_t[col] - g_val * kv_col) * beta_val; ++ // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i] ++ float kv_shard = 0.0f; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ kv_shard += s_shard[cc][r] * k_reg[r]; ++ } ++ float kv_col = warp_reduce_sum(kv_shard); + +- // fused: S[i][col] = g * S[i][col] + k[i] * delta[col] +- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] +- float attn_partial = 0.0f; ++ // delta[col] = (v[col] - g * kv[col]) * beta ++ float delta_col = (v_t[col] - g_val * kv_col) * beta_val; ++ ++ // fused: S[i][col] = g * S[i][col] + k[i] * delta[col] ++ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] ++ float attn_partial = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- s_shard[r] = g_val * s_shard[r] + k_reg[r] * delta_col; +- attn_partial += s_shard[r] * q_reg[r]; +- } ++ for (int r = 0; r < rows_per_lane; r++) { ++ s_shard[cc][r] = g_val * s_shard[cc][r] + k_reg[r] * delta_col; ++ attn_partial += s_shard[cc][r] * q_reg[r]; ++ } + +- float attn_col = warp_reduce_sum(attn_partial); ++ float attn_col = warp_reduce_sum(attn_partial); + +- if (lane == 0) { +- attn_data[col] = attn_col * scale; ++ if (lane == 0) { ++ attn_data[col] = attn_col * scale; ++ } + } + } else { +- // kv[col] = sum_i g[i] * S[i][col] * k[i] +- float kv_shard = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r]; +- } ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++ ++ // kv[col] = sum_i g[i] * S[i][col] * k[i] ++ float kv_shard = 0.0f; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r]; ++ } + +- float kv_col = warp_reduce_sum(kv_shard); ++ float kv_col = warp_reduce_sum(kv_shard); + +- // delta[col] = (v[col] - kv[col]) * beta +- float delta_col = (v_t[col] - kv_col) * beta_val; ++ // delta[col] = (v[col] - kv[col]) * beta ++ float delta_col = (v_t[col] - kv_col) * beta_val; + +- // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col] +- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] +- float attn_partial = 0.0f; ++ // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col] ++ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] ++ float attn_partial = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- s_shard[r] = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col; +- attn_partial += s_shard[r] * q_reg[r]; +- } ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ s_shard[cc][r] = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col; ++ attn_partial += s_shard[cc][r] * q_reg[r]; ++ } + +- float attn_col = warp_reduce_sum(attn_partial); ++ float attn_col = warp_reduce_sum(attn_partial); + +- if (lane == 0) { +- attn_data[col] = attn_col * scale; ++ if (lane == 0) { ++ attn_data[col] = attn_col * scale; ++ } + } + } + +@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q, + const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output + const int target_slot = (int) n_tokens - 1 - t; + if (target_slot >= 0 && target_slot < K) { +- float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- curr_state[col * S_v + i] = s_shard[r]; ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++ float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ curr_state[col * S_v + i] = s_shard[cc][r]; ++ } + } + } + } +@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q, + + if constexpr (!keep_rs_t) { + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- state[col * S_v + i] = s_shard[r]; ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ state[col * S_v + i] = s_shard[cc][r]; ++ } + } + } + } + ++// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model). ++// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep ++// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs. ++#ifndef GDN_DEFAULT_NW ++#define GDN_DEFAULT_NW 16 ++#endif ++#ifndef GDN_DEFAULT_CPW ++#define GDN_DEFAULT_CPW 8 ++#endif ++ ++template ++static void launch_gdn_variant( ++ const float * q_d, const float * k_d, const float * v_d, ++ const float * g_d, const float * b_d, const float * s_d, ++ float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head, ++ int64_t H, int64_t n_tokens, int64_t n_seqs, ++ int64_t sq1, int64_t sq2, int64_t sq3, ++ int64_t sv1, int64_t sv2, int64_t sv3, ++ int64_t sb1, int64_t sb2, int64_t sb3, ++ const uint3 neqk1_magic, const uint3 rq3_magic, ++ float scale, int K, int warp_size, cudaStream_t stream) { ++ static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v"); ++ dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP)); ++ dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1); ++ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream); ++ ggml_cuda_kernel_launch(gated_delta_net_cuda, launch_params, ++ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, ++ n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++} ++ + template + static void launch_gated_delta_net( + const float * q_d, const float * k_d, const float * v_d, +@@ -223,47 +293,55 @@ static void launch_gated_delta_net( + float scale, int K, cudaStream_t stream) { + //TODO: Add chunked kernel for even faster pre-fill + const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size; +- const int num_warps = 4; +- dim3 grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps); +- dim3 block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1); + + const uint3 neqk1_magic = init_fastdiv_values(neqk1); + const uint3 rq3_magic = init_fastdiv_values(rq3); + +- int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++#define GDN_LAUNCH_ARGS \ ++ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \ ++ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \ ++ neqk1_magic, rq3_magic, scale, K, warp_size, stream + +- const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream); + switch (S_v) { + case 16: +- ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); + break; + case 32: +- ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); + break; +- case 64: { +- ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ case 64: ++ launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); + break; +- } + case 128: { +- ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp ++ // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is ++ // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every ++ // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5). ++ static const int gdn_nw = []{ const char * e = getenv("GDN_NW"); return e ? atoi(e) : GDN_DEFAULT_NW; }(); ++ static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }(); ++ // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread ++ // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ... ++ // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4). ++ if (gdn_nw == 4 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 4 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4, 2, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 4 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4, 4, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8, 1, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8, 2, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8, 4, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8, 8, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS); ++ else launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS); + break; + } + default: + GGML_ABORT("fatal error"); + break; + } ++ ++#undef GDN_LAUNCH_ARGS + } + + void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md new file mode 100644 index 000000000..e05d87bd0 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md @@ -0,0 +1,119 @@ +# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call) + +Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s. +Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021). +NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM). +Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain. + +## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE. + +vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16. +Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes. + +## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode) +- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492): + dtype=torch.float32 elem_bytes=4 shape=(1553, 48, 128, 128) per-slot state = 786432 elems = 3.000 MiB (f32) +- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB (bf16 would be 402.7 MB) +- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32 (mamba_cache_dtype=auto/bf16 for conv) +- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config + sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch + -> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32). +- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635 + BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak (min 84.3%, p90 81.1%) + +## llama (kernel: gated_delta_net_cuda<128, 0, 0>) +- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed. +- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM). +- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf): + gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q). + Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state: + median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak. + (Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.) + +## THE GAP (equal f32 bytes, different efficiency) + llama 805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak + vLLM 805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak + => vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency. + Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32 + reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only). + +--- + +# retune-build (BUILD AGENT) — patch 0022 SHIPPED + +vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16 +is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune. + +## The change — bit-exact column folding (Lever A + B + D) + +`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params +`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns +`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column +body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns; +`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`. + +Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided +mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and +`warp_reduce_sum` XOR-butterfly are byte-for-byte unchanged. Only the +`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32 +value provably does not depend on either (columns are fully independent — column c reads only its own +S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4` +state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction +grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced +(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher +per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any +reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a +coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning; +default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**. + +## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked) + +Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W), +isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm +agent's 4.021 ms / 73.4% exactly (n=1152). + +| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak | +|---------------------------|---------|------|-------| +| base (0021) | 4.021 | 200.3| 73.4 | +| 4 x 1 (control == base) | 4.034 | 199.7| 73.1 | +| 4 x 2 | 3.887 | 207.2| 75.9 | +| 4 x 4 | 3.775 | 213.3| 78.1 | +| 8 x 1 | 3.837 | 209.9| 76.9 | +| 8 x 2 | 3.749 | 214.8| 78.7 | +| 8 x 4 | 3.699 | 217.7| 79.9 | +| 8 x 8 | 3.586 | 224.6| 82.3 | +| 16 x 2 | 3.665 | 219.8| 80.5 | +| 16 x 4 | 3.585 | 224.7| 82.3 | +| **16 x 8 (WINNER/default)** | **3.488** | **230.9** | **84.6** | +| 32 x 4 | 3.489 | 230.8| 84.6 | + +Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no +1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.** + +## Gates (both PASS, non-negotiable) + +- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner + (16,8 default) AND (4,1 control): + - q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control) + - q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control) +- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1). + +## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on + +| model | npl | base S_TG t/s | flag S_TG t/s | gain | +|-------|-----|---------------|---------------|------| +| dense 27b | 32 | 199.2 | 207.6 | +4.2% | +| dense 27b | 128 | 335.9 | 373.2 | +11.1% | +| MoE 35b-a3b | 32 | 420.6 | 440.0 | +4.6% | +| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% | + +Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples. + +## Parity vs vLLM (recurrence kernel) + +Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM** +(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close +is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path +(matmul/attn), not in gated-DeltaNet. + +Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.