feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018)

Mirror of the llama-paged-dev patch 0018 engine change plus the measured
results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB
recurrent state into the cache: the fused gated_delta_net op writes the final
state in place at the active sequences cache slot (new
ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM
fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy).

Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on):
  q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent),
    npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling).
  q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent).
nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
(391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-24 22:45:49 +00:00
parent 34cadb64af
commit 5ce2f1df51
2 changed files with 447 additions and 0 deletions

View File

@@ -0,0 +1,349 @@
From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 00:44:13 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
write-back (patch 0018)
Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
16 full-attention layers) was dominated by recurrent-state plumbing, not the
FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
recurrent state into graph scratch, then a separate ggml_cpy persisted it into
the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
ggml_gated_delta_net_inplace writes the final recurrent state directly into the
active sequences contiguous cache slot (at kv_head), removing the copy-back. The
op output then carries only the attention scores; the SSM arithmetic is
unchanged (bit-identical greedy output vs the copy-back baseline).
- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
next lever.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 14 ++++++
ggml/src/ggml-cpu/ops.cpp | 13 ++++-
ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
ggml/src/ggml.c | 68 +++++++++++++++++++++++++++
src/models/delta-net-base.cpp | 30 ++++++++++++
5 files changed, 152 insertions(+), 12 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 823f5a9..4e7ab32 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2579,6 +2579,20 @@ extern "C" {
struct ggml_tensor * state,
int64_t K);
+ // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+ // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
+ // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
+ // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
+ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst);
+
// custom operators
typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 63c07a2..9457add 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
ggml_tensor * src_g = dst->src[3];
ggml_tensor * src_beta = dst->src[4];
ggml_tensor * src_state = dst->src[5];
+ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
const int64_t S_v = src_v->ne[0];
const int64_t H = src_v->ne[1];
@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
const float scale = 1.0f / sqrtf((float) S_v);
+ // when src_state_dst is provided (in-place decode write-back) the final state is written
+ // directly into the persistent cache view, removing the separate state copy-back node.
+ float * inplace_state_base = nullptr;
+ if (src_state_dst != nullptr) {
+ GGML_ASSERT(K == 1);
+ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+ inplace_state_base = (float *) src_state_dst->data;
+ }
+
for (int64_t ir = ir0; ir < ir1; ++ir) {
const int64_t iv1 = ir % H; // head_index
const int64_t iv3 = ir / H; // sequence
@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
// For K>1, work in scratch and copy out per-token when the slot is in range.
float * s_out = (K > 1)
? state_work
- : state_out_base + (iv3 * H + iv1) * S_v * S_v;
+ : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
// copy input state into the working buffer and operate in-place
// state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index a547360..61a2b91 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
const uint3 neqk1_magic,
const uint3 rq3_magic,
float scale,
- int K) {
+ int K,
+ float * state_dst) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
// each warp owns one column, using warp-level primitives to reduce across rows
@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
float * attn_data = dst;
- float * state = dst + attn_score_elems;
+ // when state_dst is provided (in-place decode write-back) the final recurrent state is written
+ // directly into the persistent cache view instead of being appended to the op output; this
+ // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
+ float * state = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
// input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
// output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
const float * g_d, const float * b_d, const float * s_d,
- float * dst_d,
+ float * dst_d, float * state_dst_d,
int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs,
int64_t sq1, int64_t sq2, int64_t sq3,
int64_t sv1, int64_t sv2, int64_t sv3,
@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
case 32:
ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
case 64: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
}
case 128: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
}
default:
@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
ggml_tensor * src_g = dst->src[3];
ggml_tensor * src_beta = dst->src[4];
ggml_tensor * src_state = dst->src[5];
+ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const float * s_d = (const float *) src_state->data;
float * dst_d = (float *) dst->data;
+ float * state_dst_d = nullptr;
+ if (src_state_dst != nullptr) {
+ // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
+ GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+ state_dst_d = (float *) src_state_dst->data;
+ }
+
GGML_ASSERT(ggml_is_contiguous_rows(src_q));
GGML_ASSERT(ggml_is_contiguous_rows(src_k));
GGML_ASSERT(ggml_is_contiguous_rows(src_v));
@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const int K = ggml_get_op_params_i32(dst, 0);
const bool keep_rs = K > 1;
+ // in-place write-back is only valid for the single-snapshot (final-state) case
+ GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
+
if (kda) {
if (keep_rs) {
- launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
} else {
if (keep_rs) {
- launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index adbe52b..b8d34bf 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
return result;
}
+// ggml_gated_delta_net_inplace
+//
+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
+// the attention-output view layout is identical to ggml_gated_delta_net.
+struct ggml_tensor * ggml_gated_delta_net_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst) {
+ GGML_ASSERT(ggml_is_contiguous_rows(q));
+ GGML_ASSERT(ggml_is_contiguous_rows(k));
+ GGML_ASSERT(ggml_is_contiguous_rows(v));
+ GGML_ASSERT(ggml_is_contiguous(g));
+ GGML_ASSERT(ggml_is_contiguous(beta));
+ GGML_ASSERT(ggml_is_contiguous(state));
+
+ GGML_ASSERT(q->type == GGML_TYPE_F32);
+ GGML_ASSERT(k->type == GGML_TYPE_F32);
+ GGML_ASSERT(v->type == GGML_TYPE_F32);
+ GGML_ASSERT(g->type == GGML_TYPE_F32);
+ GGML_ASSERT(beta->type == GGML_TYPE_F32);
+ GGML_ASSERT(state->type == GGML_TYPE_F32);
+ GGML_ASSERT(state_dst != NULL);
+ GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
+
+ const int64_t S_v = v->ne[0];
+ const int64_t H = v->ne[1];
+ const int64_t n_tokens = v->ne[2];
+ const int64_t n_seqs = v->ne[3];
+
+ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+ GGML_ASSERT(beta->ne[0] == 1);
+
+ GGML_ASSERT(state->ne[0] == S_v);
+ GGML_ASSERT(state->ne[1] == S_v);
+ GGML_ASSERT(state->ne[2] == H);
+ GGML_ASSERT(state->ne[3] == n_seqs);
+
+ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+ GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+ const int64_t state_rows = S_v * n_seqs; // K == 1
+ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+ ggml_set_op_params_i32(result, 0, 1); // K == 1
+
+ result->op = GGML_OP_GATED_DELTA_NET;
+ result->src[0] = q;
+ result->src[1] = k;
+ result->src[2] = v;
+ result->src[3] = g;
+ result->src[4] = beta;
+ result->src[5] = state;
+ result->src[6] = state_dst;
+
+ return result;
+}
+
////////////////////////////////////////////////////////////////////////////////
struct ggml_hash_set ggml_hash_set_new(size_t size) {
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index ad9ce77..26a718b 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
const bool keep = cparams.n_rs_seq > 0;
if (!keep) {
+ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+ if (fused) {
+ // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
+ // directly into the persistent cache slot for the active sequences (a contiguous block
+ // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
+ // dominated decode. The op output then carries only the attention scores.
+ ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
+ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+
+ ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
+ if (n_seq_tokens == 1) {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+ } else {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+ }
+
+ ggml_tensor * output = ggml_view_4d(ctx0, result,
+ S_v, H_v, n_seq_tokens, n_seqs,
+ ggml_row_size(result->type, S_v),
+ ggml_row_size(result->type, S_v * H_v),
+ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+ cb(output, "attn_output", il);
+
+ // the state write is a side effect of the op; pull the op into the graph via the output
+ ggml_build_forward_expand(gf, output);
+
+ return output;
+ }
+
auto attn_out = build_delta_net(q, k, v, g, b, s, il);
ggml_tensor * output = attn_out.first;
ggml_tensor * new_state = attn_out.second;
--
2.43.0

View File

@@ -0,0 +1,98 @@
# SSM decode fix - qwen35 gated-DeltaNet in-place recurrent-state write-back (patch 0018)
Follow-up to `A2_CUDAGRAPH_DECODE.md`. That analysis located the real decode lever
on the Qwen3.6 hybrid-SSM models (arch `qwen35`, 48 gated-DeltaNet linear-attn
layers : 16 full-attn layers) and ruled out the FP4 GEMM, CUDA graphs, the host
loop, and attention. The corrected per-kernel + per-memcpy decode decomposition
attributed ~67% of decode GPU time to SSM-state plumbing:
gated_delta_net 23.4% | get_rows state-gather 21.9% | D2D state-copy 18.9% (= ~67%)
FP4 matmul ~28% | full attention 0.4%
Root cause: per SSM layer per step the fused `gated_delta_net` op wrote its new
recurrent state into graph scratch, then a **separate `ggml_cpy` persisted the
full ~225 MB state into the recurrent-state cache** (1584 D2D ops, 356 GB, 18.9%
of decode over the profile window). vLLM's `fused_recurrent_gated_delta_rule`
keeps the state in place (no copy).
## STEP 1 (this patch): kill the per-layer D2D state copy-back
`ggml_gated_delta_net_inplace` (new builder, `src[6] = state_dst`) makes the op
write its final recurrent state **directly into the active sequences' contiguous
cache slot** (at `kv_head`), eliminating the copy-back. The op output then carries
only the attention scores. SSM arithmetic is unchanged - only the destination
pointer of the final-state write moved.
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new `ggml_gated_delta_net_inplace` op
builder. `dst` retains the same `[attn | state]` layout so the attention-output
view is identical; the state region is left unused.
- `ggml/src/ggml-cuda/gated_delta_net.cu`: kernel/launch/op-handler thread an
optional `state_dst`; final-state (`!keep_rs`) write targets it when present.
- `ggml/src/ggml-cpu/ops.cpp`: K==1 path operates in place on the `state_dst`
cache view (kept CPU-correct for non-CUDA runs / CI).
- `src/models/delta-net-base.cpp`: `build_recurrent_attn` uses the in-place op on
the fused decode/prefill path and drops the `ggml_cpy`. The rollback path
(`n_rs_seq > 0`) is unchanged. The get_rows state gather is unchanged (STEP 2).
### Correctness gate
- **Bit-identical**: greedy (`--temp 0 --seed 1`) `llama-completion` output on
`q36-27b-nvfp4` is byte-for-byte identical between the copy-back baseline and the
in-place build (`diff` -> IDENTICAL).
- **Coherent**: dense + MoE multi-paragraph greedy generations are on-topic and
correct (Rayleigh scattering; Roman Empire 27 BCE / Actium 31 BCE; primes;
additive vs subtractive color).
- Gated to the `qwen35` / gated-DeltaNet fused path; rollback and all non-SSM
archs untouched (they never construct the in-place op).
### Measured decode_agg (`S_TG t/s`, npp 128, ntg 128, -fa on, paged on, fusion off)
Dense `q36-27b-nvfp4`:
| npl | baseline | in-place | delta | % of vLLM (391 @128) |
|-----|----------|----------|---------|----------------------|
| 32 | 113.74 | 136.39 | +19.9% | - |
| 128 | 146.23 | 180.53 | +23.5% | 37.4% -> 46.2% |
The npl-128 result lands on the predicted copy-removal ceiling (~180 t/s).
MoE `q36-35b-a3b-nvfp4`:
| npl | baseline | in-place | delta |
|-----|----------|----------|---------|
| 32 | 246.79 | 279.41 | +13.2% |
| 128 | 313.36 | 372.62 | +18.9% |
### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
The D2D state-copy bucket collapsed:
| bucket | before | after |
|-------------------|---------------------|----------------------|
| MEMCPY D2D | 18.9% / 356 GB / 1584 ops | 0.23% / 2.93 GB / 734 ops |
The ~225 MB/copy recurrent-state copy-back is gone (122x fewer D2D bytes); the
residual D2D is the small conv-state copies. With it removed, the remaining decode
buckets are `gated_delta_net` 26.0%, FP4 matmul ~37.5%, and `get_rows` state
gather 18.8%.
## STEP 2 (not in this patch): fuse the get_rows state gather
The state gather is now the largest single non-GEMM bucket (18.8%). It is a pure
materialization: `build_rs` calls `ggml_get_rows(cache, s_copy_main)` to copy each
sequence's previous state into a contiguous scratch tensor before the op reads it.
`ggml_ssm_scan` already avoids this by taking the `ids` tensor (`src[6]`) and
reading the per-seq state directly from the full cache. The same fusion applies
here: give `ggml_gated_delta_net` an `ids` source, read `curr_state` from
`cache + ids[seq]*D` in the kernel, and pass the full cache via the `build_rs`
`get_state_rows` lambda (mirroring `mamba-base.cpp`). Predicted ceiling with both
steps: ~247 t/s (~63% of vLLM dense @128), GEMM untouched.
## Verdict on the path to parity
STEP 1 removes ~half of the SSM plumbing overhead and is the dominant, lowest-risk
lever; it is bit-exact and shipped here. STEP 2 (gather fusion) has a proven ggml
precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLLM
after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
track. No paged/graph/block-table change can move decode on this model (full
attention is 0.4% of decode).