feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018)

Mirror of the llama-paged-dev patch 0018 engine change plus the measured results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB recurrent state into the cache: the fused gated_delta_net op writes the final state in place at the active sequences cache slot (new ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy). Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on): q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent), npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling). q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent). nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share (391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 09:09:07 -04:00 · 2026-06-24 22:45:49 +00:00
parent 34cadb64af
commit 5ce2f1df51
2 changed files with 447 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
@@ -0,0 +1,349 @@
+From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 00:44:13 +0200
+Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
+ write-back (patch 0018)
+
+Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
+16 full-attention layers) was dominated by recurrent-state plumbing, not the
+FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
+recurrent state into graph scratch, then a separate ggml_cpy persisted it into
+the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
+~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
+
+This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
+ggml_gated_delta_net_inplace writes the final recurrent state directly into the
+active sequences contiguous cache slot (at kv_head), removing the copy-back. The
+op output then carries only the attention scores; the SSM arithmetic is
+unchanged (bit-identical greedy output vs the copy-back baseline).
+
+- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
+- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
+- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
+  dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
+
+Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
+  npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
+  npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
+MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
+nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
+(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
+next lever.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h                   | 14 ++++++
+ ggml/src/ggml-cpu/ops.cpp             | 13 ++++-
+ ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
+ ggml/src/ggml.c                       | 68 +++++++++++++++++++++++++++
+ src/models/delta-net-base.cpp         | 30 ++++++++++++
+ 5 files changed, 152 insertions(+), 12 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index 823f5a9..4e7ab32 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2579,6 +2579,20 @@ extern "C" {
+             struct ggml_tensor  * state,
+             int64_t               K);
+ 
+    // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+    // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
+    // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
+    // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * q,
+            struct ggml_tensor  * k,
+            struct ggml_tensor  * v,
+            struct ggml_tensor  * g,
+            struct ggml_tensor  * beta,
+            struct ggml_tensor  * state,
+            struct ggml_tensor  * state_dst);
+
+     // custom operators
+ 
+     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 63c07a2..9457add 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+     ggml_tensor * src_g     = dst->src[3];
+     ggml_tensor * src_beta  = dst->src[4];
+     ggml_tensor * src_state = dst->src[5];
+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
+ 
+     const int64_t S_v      = src_v->ne[0];
+     const int64_t H        = src_v->ne[1];
+@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+ 
+     const float scale = 1.0f / sqrtf((float) S_v);
+ 
+    // when src_state_dst is provided (in-place decode write-back) the final state is written
+    // directly into the persistent cache view, removing the separate state copy-back node.
+    float * inplace_state_base = nullptr;
+    if (src_state_dst != nullptr) {
+        GGML_ASSERT(K == 1);
+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+        inplace_state_base = (float *) src_state_dst->data;
+    }
+
+     for (int64_t ir = ir0; ir < ir1; ++ir) {
+         const int64_t iv1 = ir % H; // head_index
+         const int64_t iv3 = ir / H; // sequence
+@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+         // For K>1, work in scratch and copy out per-token when the slot is in range.
+         float * s_out = (K > 1)
+             ? state_work
+-            : state_out_base + (iv3 * H + iv1) * S_v * S_v;
+            : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
+ 
+         // copy input state into the working buffer and operate in-place
+         // state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
+diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
+index a547360..61a2b91 100644
+--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
+@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
+                                      const uint3   neqk1_magic,
+                                      const uint3   rq3_magic,
+                                      float         scale,
+-                                     int           K) {
+                                     int           K,
+                                     float *       state_dst) {
+     const uint32_t h_idx    = blockIdx.x;
+     const uint32_t sequence = blockIdx.y;
+     // each warp owns one column, using warp-level primitives to reduce across rows
+@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
+ 
+     const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
+     float *       attn_data        = dst;
+-    float *       state            = dst + attn_score_elems;
+    // when state_dst is provided (in-place decode write-back) the final recurrent state is written
+    // directly into the persistent cache view instead of being appended to the op output; this
+    // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
+    float *       state            = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
+ 
+     // input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
+     // output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
+@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
+ static void launch_gated_delta_net(
+         const float * q_d, const float * k_d, const float * v_d,
+         const float * g_d, const float * b_d, const float * s_d,
+-        float * dst_d,
+        float * dst_d, float * state_dst_d,
+         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
+         int64_t sq1,   int64_t sq2, int64_t sq3,
+         int64_t sv1,   int64_t sv2, int64_t sv3,
+@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         case 32:
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         case 64: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         }
+         case 128: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         }
+         default:
+@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     ggml_tensor * src_g     = dst->src[3];
+     ggml_tensor * src_beta  = dst->src[4];
+     ggml_tensor * src_state = dst->src[5];
+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
+ 
+     GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
+     GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
+@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const float * s_d   = (const float *) src_state->data;
+     float *       dst_d = (float *) dst->data;
+ 
+    float * state_dst_d = nullptr;
+    if (src_state_dst != nullptr) {
+        // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
+        GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+        state_dst_d = (float *) src_state_dst->data;
+    }
+
+     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
+@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const int K = ggml_get_op_params_i32(dst, 0);
+     const bool keep_rs = K > 1;
+ 
+    // in-place write-back is only valid for the single-snapshot (final-state) case
+    GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
+
+     if (kda) {
+         if (keep_rs) {
+-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+     } else {
+         if (keep_rs) {
+-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index adbe52b..b8d34bf 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
+     return result;
+ }
+ 
+// ggml_gated_delta_net_inplace
+//
+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
+// the attention-output view layout is identical to ggml_gated_delta_net.
+struct ggml_tensor * ggml_gated_delta_net_inplace(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * q,
+        struct ggml_tensor  * k,
+        struct ggml_tensor  * v,
+        struct ggml_tensor  * g,
+        struct ggml_tensor  * beta,
+        struct ggml_tensor  * state,
+        struct ggml_tensor  * state_dst) {
+    GGML_ASSERT(ggml_is_contiguous_rows(q));
+    GGML_ASSERT(ggml_is_contiguous_rows(k));
+    GGML_ASSERT(ggml_is_contiguous_rows(v));
+    GGML_ASSERT(ggml_is_contiguous(g));
+    GGML_ASSERT(ggml_is_contiguous(beta));
+    GGML_ASSERT(ggml_is_contiguous(state));
+
+    GGML_ASSERT(q->type == GGML_TYPE_F32);
+    GGML_ASSERT(k->type == GGML_TYPE_F32);
+    GGML_ASSERT(v->type == GGML_TYPE_F32);
+    GGML_ASSERT(g->type == GGML_TYPE_F32);
+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
+    GGML_ASSERT(state->type == GGML_TYPE_F32);
+    GGML_ASSERT(state_dst != NULL);
+    GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
+
+    const int64_t S_v      = v->ne[0];
+    const int64_t H        = v->ne[1];
+    const int64_t n_tokens = v->ne[2];
+    const int64_t n_seqs   = v->ne[3];
+
+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+    GGML_ASSERT(beta->ne[0] == 1);
+
+    GGML_ASSERT(state->ne[0] == S_v);
+    GGML_ASSERT(state->ne[1] == S_v);
+    GGML_ASSERT(state->ne[2] == H);
+    GGML_ASSERT(state->ne[3] == n_seqs);
+
+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+    const int64_t state_rows = S_v * n_seqs; // K == 1
+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+    ggml_set_op_params_i32(result, 0, 1); // K == 1
+
+    result->op     = GGML_OP_GATED_DELTA_NET;
+    result->src[0] = q;
+    result->src[1] = k;
+    result->src[2] = v;
+    result->src[3] = g;
+    result->src[4] = beta;
+    result->src[5] = state;
+    result->src[6] = state_dst;
+
+    return result;
+}
+
+ ////////////////////////////////////////////////////////////////////////////////
+ 
+ struct ggml_hash_set ggml_hash_set_new(size_t size) {
+diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
+index ad9ce77..26a718b 100644
+--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
+@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+     const bool keep = cparams.n_rs_seq > 0;
+ 
+     if (!keep) {
+        const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+        if (fused) {
+            // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
+            // directly into the persistent cache slot for the active sequences (a contiguous block
+            // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
+            // dominated decode. The op output then carries only the attention scores.
+            ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+
+            ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
+            if (n_seq_tokens == 1) {
+                cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+            } else {
+                cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+            }
+
+            ggml_tensor * output = ggml_view_4d(ctx0, result,
+                    S_v, H_v, n_seq_tokens, n_seqs,
+                    ggml_row_size(result->type, S_v),
+                    ggml_row_size(result->type, S_v * H_v),
+                    ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+            cb(output, "attn_output", il);
+
+            // the state write is a side effect of the op; pull the op into the graph via the output
+            ggml_build_forward_expand(gf, output);
+
+            return output;
+        }
+
+         auto attn_out = build_delta_net(q, k, v, g, b, s, il);
+         ggml_tensor * output    = attn_out.first;
+         ggml_tensor * new_state = attn_out.second;
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md
@@ -0,0 +1,98 @@
+# SSM decode fix - qwen35 gated-DeltaNet in-place recurrent-state write-back (patch 0018)
+
+Follow-up to `A2_CUDAGRAPH_DECODE.md`. That analysis located the real decode lever
+on the Qwen3.6 hybrid-SSM models (arch `qwen35`, 48 gated-DeltaNet linear-attn
+layers : 16 full-attn layers) and ruled out the FP4 GEMM, CUDA graphs, the host
+loop, and attention. The corrected per-kernel + per-memcpy decode decomposition
+attributed ~67% of decode GPU time to SSM-state plumbing:
+
+    gated_delta_net 23.4% | get_rows state-gather 21.9% | D2D state-copy 18.9% (= ~67%)
+    FP4 matmul ~28% | full attention 0.4%
+
+Root cause: per SSM layer per step the fused `gated_delta_net` op wrote its new
+recurrent state into graph scratch, then a **separate `ggml_cpy` persisted the
+full ~225 MB state into the recurrent-state cache** (1584 D2D ops, 356 GB, 18.9%
+of decode over the profile window). vLLM's `fused_recurrent_gated_delta_rule`
+keeps the state in place (no copy).
+
+## STEP 1 (this patch): kill the per-layer D2D state copy-back
+
+`ggml_gated_delta_net_inplace` (new builder, `src[6] = state_dst`) makes the op
+write its final recurrent state **directly into the active sequences' contiguous
+cache slot** (at `kv_head`), eliminating the copy-back. The op output then carries
+only the attention scores. SSM arithmetic is unchanged - only the destination
+pointer of the final-state write moved.
+
+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new `ggml_gated_delta_net_inplace` op
+  builder. `dst` retains the same `[attn | state]` layout so the attention-output
+  view is identical; the state region is left unused.
+- `ggml/src/ggml-cuda/gated_delta_net.cu`: kernel/launch/op-handler thread an
+  optional `state_dst`; final-state (`!keep_rs`) write targets it when present.
+- `ggml/src/ggml-cpu/ops.cpp`: K==1 path operates in place on the `state_dst`
+  cache view (kept CPU-correct for non-CUDA runs / CI).
+- `src/models/delta-net-base.cpp`: `build_recurrent_attn` uses the in-place op on
+  the fused decode/prefill path and drops the `ggml_cpy`. The rollback path
+  (`n_rs_seq > 0`) is unchanged. The get_rows state gather is unchanged (STEP 2).
+
+### Correctness gate
+
+- **Bit-identical**: greedy (`--temp 0 --seed 1`) `llama-completion` output on
+  `q36-27b-nvfp4` is byte-for-byte identical between the copy-back baseline and the
+  in-place build (`diff` -> IDENTICAL).
+- **Coherent**: dense + MoE multi-paragraph greedy generations are on-topic and
+  correct (Rayleigh scattering; Roman Empire 27 BCE / Actium 31 BCE; primes;
+  additive vs subtractive color).
+- Gated to the `qwen35` / gated-DeltaNet fused path; rollback and all non-SSM
+  archs untouched (they never construct the in-place op).
+
+### Measured decode_agg (`S_TG t/s`, npp 128, ntg 128, -fa on, paged on, fusion off)
+
+Dense `q36-27b-nvfp4`:
+
+| npl | baseline | in-place | delta   | % of vLLM (391 @128) |
+|-----|----------|----------|---------|----------------------|
+| 32  | 113.74   | 136.39   | +19.9%  | -                    |
+| 128 | 146.23   | 180.53   | +23.5%  | 37.4% -> 46.2%       |
+
+The npl-128 result lands on the predicted copy-removal ceiling (~180 t/s).
+
+MoE `q36-35b-a3b-nvfp4`:
+
+| npl | baseline | in-place | delta   |
+|-----|----------|----------|---------|
+| 32  | 246.79   | 279.41   | +13.2%  |
+| 128 | 313.36   | 372.62   | +18.9%  |
+
+### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
+
+The D2D state-copy bucket collapsed:
+
+| bucket            | before              | after                |
+|-------------------|---------------------|----------------------|
+| MEMCPY D2D        | 18.9% / 356 GB / 1584 ops | 0.23% / 2.93 GB / 734 ops |
+
+The ~225 MB/copy recurrent-state copy-back is gone (122x fewer D2D bytes); the
+residual D2D is the small conv-state copies. With it removed, the remaining decode
+buckets are `gated_delta_net` 26.0%, FP4 matmul ~37.5%, and `get_rows` state
+gather 18.8%.
+
+## STEP 2 (not in this patch): fuse the get_rows state gather
+
+The state gather is now the largest single non-GEMM bucket (18.8%). It is a pure
+materialization: `build_rs` calls `ggml_get_rows(cache, s_copy_main)` to copy each
+sequence's previous state into a contiguous scratch tensor before the op reads it.
+`ggml_ssm_scan` already avoids this by taking the `ids` tensor (`src[6]`) and
+reading the per-seq state directly from the full cache. The same fusion applies
+here: give `ggml_gated_delta_net` an `ids` source, read `curr_state` from
+`cache + ids[seq]*D` in the kernel, and pass the full cache via the `build_rs`
+`get_state_rows` lambda (mirroring `mamba-base.cpp`). Predicted ceiling with both
+steps: ~247 t/s (~63% of vLLM dense @128), GEMM untouched.
+
+## Verdict on the path to parity
+
+STEP 1 removes ~half of the SSM plumbing overhead and is the dominant, lowest-risk
+lever; it is bit-exact and shipped here. STEP 2 (gather fusion) has a proven ggml
+precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLLM
+after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
+track. No paged/graph/block-table change can move decode on this model (full
+attention is 0.4% of decode).