mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-25 17:12:10 -04:00
feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022)
Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA order byte-identical (md5-gateable). The bf16-state plan stays shelved. Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only the (warp,block)->column assignment differs and the result is bit-identical; per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner (NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW. GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128 335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5 byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4; test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in OCCUPANCY_RETUNE_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,403 @@
|
||||
From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Thu, 25 Jun 2026 20:32:55 +0200
|
||||
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
|
||||
occupancy/coalescing retune (patch 0022)
|
||||
|
||||
Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
|
||||
kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
|
||||
float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
|
||||
efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
|
||||
the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
|
||||
order byte-identical (md5-gateable). The bf16-state plan stays shelved.
|
||||
|
||||
Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
|
||||
(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
|
||||
state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
|
||||
... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
|
||||
grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
|
||||
across the lanes by the same strided i = r*warp_size + lane mapping, and every
|
||||
column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
|
||||
unchanged; only the (warp,block)->column assignment and visit order differ, which a
|
||||
column's value provably does not depend on (columns are fully independent). This
|
||||
raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
|
||||
state-load bursts before any reduction + interleaved butterfly reductions hiding
|
||||
each other's shfl latency), covering more DRAM latency on this bandwidth-bound
|
||||
kernel. Every global access stays identically coalesced, so it is a scheduling /
|
||||
latency-coverage win, not a coalescing change. The forbidden float4 state load
|
||||
(which would repartition a lane to 4 contiguous rows and change the reduction
|
||||
grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
|
||||
env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
|
||||
GB10 winner (16, 8).
|
||||
|
||||
GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
|
||||
gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
|
||||
200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
|
||||
102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
|
||||
dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
|
||||
35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
|
||||
unchanged.
|
||||
|
||||
Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
|
||||
both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
|
||||
test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
|
||||
1 file changed, 157 insertions(+), 79 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
index 86d5e2a..d071d5a 100644
|
||||
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
@@ -1,6 +1,8 @@
|
||||
#include "gated_delta_net.cuh"
|
||||
#include "ggml-cuda/common.cuh"
|
||||
|
||||
+#include <cstdlib>
|
||||
+
|
||||
// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
|
||||
// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
|
||||
// destination slot by the recurrence kernel and are skipped here. One block per sequence.
|
||||
@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
|
||||
gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
|
||||
}
|
||||
|
||||
-template <int S_v, bool KDA, bool keep_rs_t>
|
||||
-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
|
||||
+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
|
||||
+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
|
||||
+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
|
||||
+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
|
||||
+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
|
||||
+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
|
||||
+// f32 value provably does not depend on either (columns are fully independent: column c reads only
|
||||
+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
|
||||
+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
|
||||
+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
|
||||
+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
|
||||
+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
|
||||
+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
|
||||
+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
|
||||
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
|
||||
+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
|
||||
gated_delta_net_cuda(const float * q,
|
||||
const float * k,
|
||||
const float * v,
|
||||
@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
|
||||
int rs_head) {
|
||||
const uint32_t h_idx = blockIdx.x;
|
||||
const uint32_t sequence = blockIdx.y;
|
||||
- // each warp owns one column, using warp-level primitives to reduce across rows
|
||||
+ // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
|
||||
const int lane = threadIdx.x;
|
||||
- const int col = blockIdx.z * blockDim.y + threadIdx.y;
|
||||
+ const int col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
|
||||
|
||||
const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
|
||||
const uint32_t iq3 = fastdiv(sequence, rq3_magic);
|
||||
@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
|
||||
// writing the same slot per block (identity) is race-free.
|
||||
const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
|
||||
? state_dst : curr_state;
|
||||
- read_state += state_in_offset + col * S_v;
|
||||
+ read_state += state_in_offset;
|
||||
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
|
||||
|
||||
constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
|
||||
static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
|
||||
constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
|
||||
- float s_shard[rows_per_lane];
|
||||
- // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
|
||||
+ // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
|
||||
+ float s_shard[COLS_PER_WARP][rows_per_lane];
|
||||
|
||||
ggml_cuda_pdl_sync();
|
||||
#pragma unroll
|
||||
- for (int r = 0; r < rows_per_lane; r++) {
|
||||
- const int i = r * warp_size + lane;
|
||||
- s_shard[r] = read_state[i];
|
||||
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
|
||||
+ const int col = col_base + cc * NUM_WARPS;
|
||||
+ const float * rs = read_state + col * S_v;
|
||||
+#pragma unroll
|
||||
+ for (int r = 0; r < rows_per_lane; r++) {
|
||||
+ const int i = r * warp_size + lane;
|
||||
+ s_shard[cc][r] = rs[i];
|
||||
+ }
|
||||
}
|
||||
|
||||
for (int t = 0; t < n_tokens; t++) {
|
||||
@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
|
||||
|
||||
const float beta_val = *beta_t;
|
||||
|
||||
- // Cache k and q in registers
|
||||
+ // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
|
||||
float k_reg[rows_per_lane];
|
||||
float q_reg[rows_per_lane];
|
||||
#pragma unroll
|
||||
@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
|
||||
if constexpr (!KDA) {
|
||||
const float g_val = expf(*g_t);
|
||||
|
||||
- // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
|
||||
- float kv_shard = 0.0f;
|
||||
#pragma unroll
|
||||
- for (int r = 0; r < rows_per_lane; r++) {
|
||||
- kv_shard += s_shard[r] * k_reg[r];
|
||||
- }
|
||||
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
|
||||
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
|
||||
+ const int col = col_base + cc * NUM_WARPS;
|
||||
|
||||
- // delta[col] = (v[col] - g * kv[col]) * beta
|
||||
- float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
|
||||
+ // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
|
||||
+ float kv_shard = 0.0f;
|
||||
+#pragma unroll
|
||||
+ for (int r = 0; r < rows_per_lane; r++) {
|
||||
+ kv_shard += s_shard[cc][r] * k_reg[r];
|
||||
+ }
|
||||
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
|
||||
|
||||
- // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
|
||||
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
|
||||
- float attn_partial = 0.0f;
|
||||
+ // delta[col] = (v[col] - g * kv[col]) * beta
|
||||
+ float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
|
||||
+
|
||||
+ // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
|
||||
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
|
||||
+ float attn_partial = 0.0f;
|
||||
#pragma unroll
|
||||
- for (int r = 0; r < rows_per_lane; r++) {
|
||||
- s_shard[r] = g_val * s_shard[r] + k_reg[r] * delta_col;
|
||||
- attn_partial += s_shard[r] * q_reg[r];
|
||||
- }
|
||||
+ for (int r = 0; r < rows_per_lane; r++) {
|
||||
+ s_shard[cc][r] = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
|
||||
+ attn_partial += s_shard[cc][r] * q_reg[r];
|
||||
+ }
|
||||
|
||||
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
|
||||
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
|
||||
|
||||
- if (lane == 0) {
|
||||
- attn_data[col] = attn_col * scale;
|
||||
+ if (lane == 0) {
|
||||
+ attn_data[col] = attn_col * scale;
|
||||
+ }
|
||||
}
|
||||
} else {
|
||||
- // kv[col] = sum_i g[i] * S[i][col] * k[i]
|
||||
- float kv_shard = 0.0f;
|
||||
#pragma unroll
|
||||
- for (int r = 0; r < rows_per_lane; r++) {
|
||||
- const int i = r * warp_size + lane;
|
||||
- kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
|
||||
- }
|
||||
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
|
||||
+ const int col = col_base + cc * NUM_WARPS;
|
||||
+
|
||||
+ // kv[col] = sum_i g[i] * S[i][col] * k[i]
|
||||
+ float kv_shard = 0.0f;
|
||||
+#pragma unroll
|
||||
+ for (int r = 0; r < rows_per_lane; r++) {
|
||||
+ const int i = r * warp_size + lane;
|
||||
+ kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
|
||||
+ }
|
||||
|
||||
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
|
||||
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
|
||||
|
||||
- // delta[col] = (v[col] - kv[col]) * beta
|
||||
- float delta_col = (v_t[col] - kv_col) * beta_val;
|
||||
+ // delta[col] = (v[col] - kv[col]) * beta
|
||||
+ float delta_col = (v_t[col] - kv_col) * beta_val;
|
||||
|
||||
- // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
|
||||
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
|
||||
- float attn_partial = 0.0f;
|
||||
+ // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
|
||||
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
|
||||
+ float attn_partial = 0.0f;
|
||||
#pragma unroll
|
||||
- for (int r = 0; r < rows_per_lane; r++) {
|
||||
- const int i = r * warp_size + lane;
|
||||
- s_shard[r] = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
|
||||
- attn_partial += s_shard[r] * q_reg[r];
|
||||
- }
|
||||
+ for (int r = 0; r < rows_per_lane; r++) {
|
||||
+ const int i = r * warp_size + lane;
|
||||
+ s_shard[cc][r] = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
|
||||
+ attn_partial += s_shard[cc][r] * q_reg[r];
|
||||
+ }
|
||||
|
||||
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
|
||||
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
|
||||
|
||||
- if (lane == 0) {
|
||||
- attn_data[col] = attn_col * scale;
|
||||
+ if (lane == 0) {
|
||||
+ attn_data[col] = attn_col * scale;
|
||||
+ }
|
||||
}
|
||||
}
|
||||
|
||||
@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
|
||||
const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
|
||||
const int target_slot = (int) n_tokens - 1 - t;
|
||||
if (target_slot >= 0 && target_slot < K) {
|
||||
- float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
|
||||
#pragma unroll
|
||||
- for (int r = 0; r < rows_per_lane; r++) {
|
||||
- const int i = r * warp_size + lane;
|
||||
- curr_state[col * S_v + i] = s_shard[r];
|
||||
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
|
||||
+ const int col = col_base + cc * NUM_WARPS;
|
||||
+ float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
|
||||
+#pragma unroll
|
||||
+ for (int r = 0; r < rows_per_lane; r++) {
|
||||
+ const int i = r * warp_size + lane;
|
||||
+ curr_state[col * S_v + i] = s_shard[cc][r];
|
||||
+ }
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
|
||||
|
||||
if constexpr (!keep_rs_t) {
|
||||
#pragma unroll
|
||||
- for (int r = 0; r < rows_per_lane; r++) {
|
||||
- const int i = r * warp_size + lane;
|
||||
- state[col * S_v + i] = s_shard[r];
|
||||
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
|
||||
+ const int col = col_base + cc * NUM_WARPS;
|
||||
+#pragma unroll
|
||||
+ for (int r = 0; r < rows_per_lane; r++) {
|
||||
+ const int i = r * warp_size + lane;
|
||||
+ state[col * S_v + i] = s_shard[cc][r];
|
||||
+ }
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
|
||||
+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
|
||||
+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
|
||||
+#ifndef GDN_DEFAULT_NW
|
||||
+#define GDN_DEFAULT_NW 16
|
||||
+#endif
|
||||
+#ifndef GDN_DEFAULT_CPW
|
||||
+#define GDN_DEFAULT_CPW 8
|
||||
+#endif
|
||||
+
|
||||
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
|
||||
+static void launch_gdn_variant(
|
||||
+ const float * q_d, const float * k_d, const float * v_d,
|
||||
+ const float * g_d, const float * b_d, const float * s_d,
|
||||
+ float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
|
||||
+ int64_t H, int64_t n_tokens, int64_t n_seqs,
|
||||
+ int64_t sq1, int64_t sq2, int64_t sq3,
|
||||
+ int64_t sv1, int64_t sv2, int64_t sv3,
|
||||
+ int64_t sb1, int64_t sb2, int64_t sb3,
|
||||
+ const uint3 neqk1_magic, const uint3 rq3_magic,
|
||||
+ float scale, int K, int warp_size, cudaStream_t stream) {
|
||||
+ static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
|
||||
+ dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
|
||||
+ dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
|
||||
+ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
|
||||
+ ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
|
||||
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
|
||||
+ n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
|
||||
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
|
||||
+}
|
||||
+
|
||||
template <bool KDA, bool keep_rs_t>
|
||||
static void launch_gated_delta_net(
|
||||
const float * q_d, const float * k_d, const float * v_d,
|
||||
@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
|
||||
float scale, int K, cudaStream_t stream) {
|
||||
//TODO: Add chunked kernel for even faster pre-fill
|
||||
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
|
||||
- const int num_warps = 4;
|
||||
- dim3 grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
|
||||
- dim3 block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
|
||||
|
||||
const uint3 neqk1_magic = init_fastdiv_values(neqk1);
|
||||
const uint3 rq3_magic = init_fastdiv_values(rq3);
|
||||
|
||||
- int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
|
||||
+#define GDN_LAUNCH_ARGS \
|
||||
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
|
||||
+ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
|
||||
+ neqk1_magic, rq3_magic, scale, K, warp_size, stream
|
||||
|
||||
- const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
|
||||
switch (S_v) {
|
||||
case 16:
|
||||
- ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
|
||||
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
|
||||
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
|
||||
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
|
||||
+ launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
|
||||
break;
|
||||
case 32:
|
||||
- ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
|
||||
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
|
||||
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
|
||||
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
|
||||
+ launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
|
||||
break;
|
||||
- case 64: {
|
||||
- ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
|
||||
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
|
||||
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
|
||||
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
|
||||
+ case 64:
|
||||
+ launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
|
||||
break;
|
||||
- }
|
||||
case 128: {
|
||||
- ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
|
||||
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
|
||||
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
|
||||
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
|
||||
+ // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
|
||||
+ // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
|
||||
+ // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
|
||||
+ // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
|
||||
+ static const int gdn_nw = []{ const char * e = getenv("GDN_NW"); return e ? atoi(e) : GDN_DEFAULT_NW; }();
|
||||
+ static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
|
||||
+ // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
|
||||
+ // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
|
||||
+ // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
|
||||
+ if (gdn_nw == 4 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 4 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4, 2, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 4 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4, 4, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 8 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8, 1, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 8 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8, 2, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 8 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8, 4, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 8 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8, 8, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
|
||||
+ else launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
|
||||
break;
|
||||
}
|
||||
default:
|
||||
GGML_ABORT("fatal error");
|
||||
break;
|
||||
}
|
||||
+
|
||||
+#undef GDN_LAUNCH_ARGS
|
||||
}
|
||||
|
||||
void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
|
||||
--
|
||||
2.43.0
|
||||
|
||||
119
backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
Normal file
119
backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
Normal file
@@ -0,0 +1,119 @@
|
||||
# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
|
||||
|
||||
Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
|
||||
Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
|
||||
NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
|
||||
Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
|
||||
|
||||
## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
|
||||
|
||||
vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
|
||||
Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
|
||||
|
||||
## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
|
||||
- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
|
||||
dtype=torch.float32 elem_bytes=4 shape=(1553, 48, 128, 128) per-slot state = 786432 elems = 3.000 MiB (f32)
|
||||
- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB (bf16 would be 402.7 MB)
|
||||
- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32 (mamba_cache_dtype=auto/bf16 for conv)
|
||||
- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
|
||||
sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
|
||||
-> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
|
||||
- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
|
||||
BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak (min 84.3%, p90 81.1%)
|
||||
|
||||
## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
|
||||
- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
|
||||
- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
|
||||
- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
|
||||
gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
|
||||
Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
|
||||
median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
|
||||
(Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
|
||||
|
||||
## THE GAP (equal f32 bytes, different efficiency)
|
||||
llama 805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
|
||||
vLLM 805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
|
||||
=> vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
|
||||
Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
|
||||
reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
|
||||
|
||||
---
|
||||
|
||||
# retune-build (BUILD AGENT) — patch 0022 SHIPPED
|
||||
|
||||
vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
|
||||
is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
|
||||
|
||||
## The change — bit-exact column folding (Lever A + B + D)
|
||||
|
||||
`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
|
||||
`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
|
||||
`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
|
||||
body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
|
||||
`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
|
||||
|
||||
Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
|
||||
mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
|
||||
`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
|
||||
`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
|
||||
value provably does not depend on either (columns are fully independent — column c reads only its own
|
||||
S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
|
||||
state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
|
||||
grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
|
||||
(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
|
||||
per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
|
||||
reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
|
||||
coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
|
||||
default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
|
||||
|
||||
## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
|
||||
|
||||
Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
|
||||
isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
|
||||
agent's 4.021 ms / 73.4% exactly (n=1152).
|
||||
|
||||
| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
|
||||
|---------------------------|---------|------|-------|
|
||||
| base (0021) | 4.021 | 200.3| 73.4 |
|
||||
| 4 x 1 (control == base) | 4.034 | 199.7| 73.1 |
|
||||
| 4 x 2 | 3.887 | 207.2| 75.9 |
|
||||
| 4 x 4 | 3.775 | 213.3| 78.1 |
|
||||
| 8 x 1 | 3.837 | 209.9| 76.9 |
|
||||
| 8 x 2 | 3.749 | 214.8| 78.7 |
|
||||
| 8 x 4 | 3.699 | 217.7| 79.9 |
|
||||
| 8 x 8 | 3.586 | 224.6| 82.3 |
|
||||
| 16 x 2 | 3.665 | 219.8| 80.5 |
|
||||
| 16 x 4 | 3.585 | 224.7| 82.3 |
|
||||
| **16 x 8 (WINNER/default)** | **3.488** | **230.9** | **84.6** |
|
||||
| 32 x 4 | 3.489 | 230.8| 84.6 |
|
||||
|
||||
Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
|
||||
1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
|
||||
|
||||
## Gates (both PASS, non-negotiable)
|
||||
|
||||
- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
|
||||
(16,8 default) AND (4,1 control):
|
||||
- q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
|
||||
- q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
|
||||
- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
|
||||
|
||||
## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
|
||||
|
||||
| model | npl | base S_TG t/s | flag S_TG t/s | gain |
|
||||
|-------|-----|---------------|---------------|------|
|
||||
| dense 27b | 32 | 199.2 | 207.6 | +4.2% |
|
||||
| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
|
||||
| MoE 35b-a3b | 32 | 420.6 | 440.0 | +4.6% |
|
||||
| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
|
||||
|
||||
Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
|
||||
|
||||
## Parity vs vLLM (recurrence kernel)
|
||||
|
||||
Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
|
||||
(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
|
||||
is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
|
||||
(matmul/attn), not in gated-DeltaNet.
|
||||
|
||||
Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.
|
||||
Reference in New Issue
Block a user