From 5ce2f1df51f5d3953426497cfd7080f821803c1d Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 24 Jun 2026 22:45:49 +0000 Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018) Mirror of the llama-paged-dev patch 0018 engine change plus the measured results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB recurrent state into the cache: the fused gated_delta_net op writes the final state in place at the active sequences cache slot (new ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy). Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on): q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent), npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling). q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent). nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share (391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- ...0018-qwen35-ssm-decode-inplace-state.patch | 349 ++++++++++++++++++ .../patches/paged/SSM_DECODE_FIX_RESULTS.md | 98 +++++ 2 files changed, 447 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch create mode 100644 backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md diff --git a/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch b/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch new file mode 100644 index 000000000..2db002a66 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch @@ -0,0 +1,349 @@ +From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 00:44:13 +0200 +Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state + write-back (patch 0018) + +Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet : +16 full-attention layers) was dominated by recurrent-state plumbing, not the +FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new +recurrent state into graph scratch, then a separate ggml_cpy persisted it into +the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that +~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window). + +This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place): +ggml_gated_delta_net_inplace writes the final recurrent state directly into the +active sequences contiguous cache slot (at kv_head), removing the copy-back. The +op output then carries only the attention scores; the SSM arithmetic is +unchanged (bit-identical greedy output vs the copy-back baseline). + +- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view) +- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there +- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path, + dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged + +Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on): + npl 32 : 113.74 -> 136.39 t/s (+19.9 percent) + npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling) +MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent). +nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share +(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the +next lever. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/include/ggml.h | 14 ++++++ + ggml/src/ggml-cpu/ops.cpp | 13 ++++- + ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++----- + ggml/src/ggml.c | 68 +++++++++++++++++++++++++++ + src/models/delta-net-base.cpp | 30 ++++++++++++ + 5 files changed, 152 insertions(+), 12 deletions(-) + +diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h +index 823f5a9..4e7ab32 100644 +--- a/ggml/include/ggml.h ++++ b/ggml/include/ggml.h +@@ -2579,6 +2579,20 @@ extern "C" { + struct ggml_tensor * state, + int64_t K); + ++ // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written ++ // in place into state_dst (a view into the recurrent-state cache) instead of being appended to ++ // the op output, eliminating the per-step state copy-back during decode. state_dst must be a ++ // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size). ++ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace( ++ struct ggml_context * ctx, ++ struct ggml_tensor * q, ++ struct ggml_tensor * k, ++ struct ggml_tensor * v, ++ struct ggml_tensor * g, ++ struct ggml_tensor * beta, ++ struct ggml_tensor * state, ++ struct ggml_tensor * state_dst); ++ + // custom operators + + typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata); +diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp +index 63c07a2..9457add 100644 +--- a/ggml/src/ggml-cpu/ops.cpp ++++ b/ggml/src/ggml-cpu/ops.cpp +@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + ggml_tensor * src_g = dst->src[3]; + ggml_tensor * src_beta = dst->src[4]; + ggml_tensor * src_state = dst->src[5]; ++ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target + + const int64_t S_v = src_v->ne[0]; + const int64_t H = src_v->ne[1]; +@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + + const float scale = 1.0f / sqrtf((float) S_v); + ++ // when src_state_dst is provided (in-place decode write-back) the final state is written ++ // directly into the persistent cache view, removing the separate state copy-back node. ++ float * inplace_state_base = nullptr; ++ if (src_state_dst != nullptr) { ++ GGML_ASSERT(K == 1); ++ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float)); ++ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float)); ++ inplace_state_base = (float *) src_state_dst->data; ++ } ++ + for (int64_t ir = ir0; ir < ir1; ++ir) { + const int64_t iv1 = ir % H; // head_index + const int64_t iv3 = ir / H; // sequence +@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + // For K>1, work in scratch and copy out per-token when the slot is in range. + float * s_out = (K > 1) + ? state_work +- : state_out_base + (iv3 * H + iv1) * S_v * S_v; ++ : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v; + + // copy input state into the working buffer and operate in-place + // state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride. +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index a547360..61a2b91 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q, + const uint3 neqk1_magic, + const uint3 rq3_magic, + float scale, +- int K) { ++ int K, ++ float * state_dst) { + const uint32_t h_idx = blockIdx.x; + const uint32_t sequence = blockIdx.y; + // each warp owns one column, using warp-level primitives to reduce across rows +@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q, + + const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs; + float * attn_data = dst; +- float * state = dst + attn_score_elems; ++ // when state_dst is provided (in-place decode write-back) the final recurrent state is written ++ // directly into the persistent cache view instead of being appended to the op output; this ++ // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false. ++ float * state = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems); + + // input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v. + // output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before. +@@ -171,7 +175,7 @@ template + static void launch_gated_delta_net( + const float * q_d, const float * k_d, const float * v_d, + const float * g_d, const float * b_d, const float * s_d, +- float * dst_d, ++ float * dst_d, float * state_dst_d, + int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs, + int64_t sq1, int64_t sq2, int64_t sq3, + int64_t sv1, int64_t sv2, int64_t sv3, +@@ -195,26 +199,26 @@ static void launch_gated_delta_net( + ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + case 32: + ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + case 64: { + ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + } + case 128: { + ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + } + default: +@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + ggml_tensor * src_g = dst->src[3]; + ggml_tensor * src_beta = dst->src[4]; + ggml_tensor * src_state = dst->src[5]; ++ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target + + GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne); + GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb); +@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + const float * s_d = (const float *) src_state->data; + float * dst_d = (float *) dst->data; + ++ float * state_dst_d = nullptr; ++ if (src_state_dst != nullptr) { ++ // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H ++ GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32); ++ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float)); ++ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float)); ++ state_dst_d = (float *) src_state_dst->data; ++ } ++ + GGML_ASSERT(ggml_is_contiguous_rows(src_q)); + GGML_ASSERT(ggml_is_contiguous_rows(src_k)); + GGML_ASSERT(ggml_is_contiguous_rows(src_v)); +@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + const int K = ggml_get_op_params_i32(dst, 0); + const bool keep_rs = K > 1; + ++ // in-place write-back is only valid for the single-snapshot (final-state) case ++ GGML_ASSERT(state_dst_d == nullptr || !keep_rs); ++ + if (kda) { + if (keep_rs) { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } else { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } + } else { + if (keep_rs) { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } else { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } +diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c +index adbe52b..b8d34bf 100644 +--- a/ggml/src/ggml.c ++++ b/ggml/src/ggml.c +@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net( + return result; + } + ++// ggml_gated_delta_net_inplace ++// ++// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written ++// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being ++// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode. ++// The op output holds ONLY the attention scores; the state region is still allocated (unused) so ++// the attention-output view layout is identical to ggml_gated_delta_net. ++struct ggml_tensor * ggml_gated_delta_net_inplace( ++ struct ggml_context * ctx, ++ struct ggml_tensor * q, ++ struct ggml_tensor * k, ++ struct ggml_tensor * v, ++ struct ggml_tensor * g, ++ struct ggml_tensor * beta, ++ struct ggml_tensor * state, ++ struct ggml_tensor * state_dst) { ++ GGML_ASSERT(ggml_is_contiguous_rows(q)); ++ GGML_ASSERT(ggml_is_contiguous_rows(k)); ++ GGML_ASSERT(ggml_is_contiguous_rows(v)); ++ GGML_ASSERT(ggml_is_contiguous(g)); ++ GGML_ASSERT(ggml_is_contiguous(beta)); ++ GGML_ASSERT(ggml_is_contiguous(state)); ++ ++ GGML_ASSERT(q->type == GGML_TYPE_F32); ++ GGML_ASSERT(k->type == GGML_TYPE_F32); ++ GGML_ASSERT(v->type == GGML_TYPE_F32); ++ GGML_ASSERT(g->type == GGML_TYPE_F32); ++ GGML_ASSERT(beta->type == GGML_TYPE_F32); ++ GGML_ASSERT(state->type == GGML_TYPE_F32); ++ GGML_ASSERT(state_dst != NULL); ++ GGML_ASSERT(state_dst->type == GGML_TYPE_F32); ++ ++ const int64_t S_v = v->ne[0]; ++ const int64_t H = v->ne[1]; ++ const int64_t n_tokens = v->ne[2]; ++ const int64_t n_seqs = v->ne[3]; ++ ++ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v); ++ GGML_ASSERT(beta->ne[0] == 1); ++ ++ GGML_ASSERT(state->ne[0] == S_v); ++ GGML_ASSERT(state->ne[1] == S_v); ++ GGML_ASSERT(state->ne[2] == H); ++ GGML_ASSERT(state->ne[3] == n_seqs); ++ ++ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs] ++ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H); ++ GGML_ASSERT(state_dst->ne[1] >= n_seqs); ++ GGML_ASSERT(state_dst->nb[0] == sizeof(float)); ++ ++ const int64_t state_rows = S_v * n_seqs; // K == 1 ++ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 }; ++ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne); ++ ++ ggml_set_op_params_i32(result, 0, 1); // K == 1 ++ ++ result->op = GGML_OP_GATED_DELTA_NET; ++ result->src[0] = q; ++ result->src[1] = k; ++ result->src[2] = v; ++ result->src[3] = g; ++ result->src[4] = beta; ++ result->src[5] = state; ++ result->src[6] = state_dst; ++ ++ return result; ++} ++ + //////////////////////////////////////////////////////////////////////////////// + + struct ggml_hash_set ggml_hash_set_new(size_t size) { +diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp +index ad9ce77..26a718b 100644 +--- a/src/models/delta-net-base.cpp ++++ b/src/models/delta-net-base.cpp +@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn( + const bool keep = cparams.n_rs_seq > 0; + + if (!keep) { ++ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch; ++ ++ if (fused) { ++ // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state ++ // directly into the persistent cache slot for the active sequences (a contiguous block ++ // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that ++ // dominated decode. The op output then carries only the attention scores. ++ ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs, ++ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all)); ++ ++ ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst); ++ if (n_seq_tokens == 1) { ++ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il); ++ } else { ++ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il); ++ } ++ ++ ggml_tensor * output = ggml_view_4d(ctx0, result, ++ S_v, H_v, n_seq_tokens, n_seqs, ++ ggml_row_size(result->type, S_v), ++ ggml_row_size(result->type, S_v * H_v), ++ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0); ++ cb(output, "attn_output", il); ++ ++ // the state write is a side effect of the op; pull the op into the graph via the output ++ ggml_build_forward_expand(gf, output); ++ ++ return output; ++ } ++ + auto attn_out = build_delta_net(q, k, v, g, b, s, il); + ggml_tensor * output = attn_out.first; + ggml_tensor * new_state = attn_out.second; +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md new file mode 100644 index 000000000..2e7c8c203 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/SSM_DECODE_FIX_RESULTS.md @@ -0,0 +1,98 @@ +# SSM decode fix - qwen35 gated-DeltaNet in-place recurrent-state write-back (patch 0018) + +Follow-up to `A2_CUDAGRAPH_DECODE.md`. That analysis located the real decode lever +on the Qwen3.6 hybrid-SSM models (arch `qwen35`, 48 gated-DeltaNet linear-attn +layers : 16 full-attn layers) and ruled out the FP4 GEMM, CUDA graphs, the host +loop, and attention. The corrected per-kernel + per-memcpy decode decomposition +attributed ~67% of decode GPU time to SSM-state plumbing: + + gated_delta_net 23.4% | get_rows state-gather 21.9% | D2D state-copy 18.9% (= ~67%) + FP4 matmul ~28% | full attention 0.4% + +Root cause: per SSM layer per step the fused `gated_delta_net` op wrote its new +recurrent state into graph scratch, then a **separate `ggml_cpy` persisted the +full ~225 MB state into the recurrent-state cache** (1584 D2D ops, 356 GB, 18.9% +of decode over the profile window). vLLM's `fused_recurrent_gated_delta_rule` +keeps the state in place (no copy). + +## STEP 1 (this patch): kill the per-layer D2D state copy-back + +`ggml_gated_delta_net_inplace` (new builder, `src[6] = state_dst`) makes the op +write its final recurrent state **directly into the active sequences' contiguous +cache slot** (at `kv_head`), eliminating the copy-back. The op output then carries +only the attention scores. SSM arithmetic is unchanged - only the destination +pointer of the final-state write moved. + +- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new `ggml_gated_delta_net_inplace` op + builder. `dst` retains the same `[attn | state]` layout so the attention-output + view is identical; the state region is left unused. +- `ggml/src/ggml-cuda/gated_delta_net.cu`: kernel/launch/op-handler thread an + optional `state_dst`; final-state (`!keep_rs`) write targets it when present. +- `ggml/src/ggml-cpu/ops.cpp`: K==1 path operates in place on the `state_dst` + cache view (kept CPU-correct for non-CUDA runs / CI). +- `src/models/delta-net-base.cpp`: `build_recurrent_attn` uses the in-place op on + the fused decode/prefill path and drops the `ggml_cpy`. The rollback path + (`n_rs_seq > 0`) is unchanged. The get_rows state gather is unchanged (STEP 2). + +### Correctness gate + +- **Bit-identical**: greedy (`--temp 0 --seed 1`) `llama-completion` output on + `q36-27b-nvfp4` is byte-for-byte identical between the copy-back baseline and the + in-place build (`diff` -> IDENTICAL). +- **Coherent**: dense + MoE multi-paragraph greedy generations are on-topic and + correct (Rayleigh scattering; Roman Empire 27 BCE / Actium 31 BCE; primes; + additive vs subtractive color). +- Gated to the `qwen35` / gated-DeltaNet fused path; rollback and all non-SSM + archs untouched (they never construct the in-place op). + +### Measured decode_agg (`S_TG t/s`, npp 128, ntg 128, -fa on, paged on, fusion off) + +Dense `q36-27b-nvfp4`: + +| npl | baseline | in-place | delta | % of vLLM (391 @128) | +|-----|----------|----------|---------|----------------------| +| 32 | 113.74 | 136.39 | +19.9% | - | +| 128 | 146.23 | 180.53 | +23.5% | 37.4% -> 46.2% | + +The npl-128 result lands on the predicted copy-removal ceiling (~180 t/s). + +MoE `q36-35b-a3b-nvfp4`: + +| npl | baseline | in-place | delta | +|-----|----------|----------|---------| +| 32 | 246.79 | 279.41 | +13.2% | +| 128 | 313.36 | 372.62 | +18.9% | + +### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager) + +The D2D state-copy bucket collapsed: + +| bucket | before | after | +|-------------------|---------------------|----------------------| +| MEMCPY D2D | 18.9% / 356 GB / 1584 ops | 0.23% / 2.93 GB / 734 ops | + +The ~225 MB/copy recurrent-state copy-back is gone (122x fewer D2D bytes); the +residual D2D is the small conv-state copies. With it removed, the remaining decode +buckets are `gated_delta_net` 26.0%, FP4 matmul ~37.5%, and `get_rows` state +gather 18.8%. + +## STEP 2 (not in this patch): fuse the get_rows state gather + +The state gather is now the largest single non-GEMM bucket (18.8%). It is a pure +materialization: `build_rs` calls `ggml_get_rows(cache, s_copy_main)` to copy each +sequence's previous state into a contiguous scratch tensor before the op reads it. +`ggml_ssm_scan` already avoids this by taking the `ids` tensor (`src[6]`) and +reading the per-seq state directly from the full cache. The same fusion applies +here: give `ggml_gated_delta_net` an `ids` source, read `curr_state` from +`cache + ids[seq]*D` in the kernel, and pass the full cache via the `build_rs` +`get_state_rows` lambda (mirroring `mamba-base.cpp`). Predicted ceiling with both +steps: ~247 t/s (~63% of vLLM dense @128), GEMM untouched. + +## Verdict on the path to parity + +STEP 1 removes ~half of the SSM plumbing overhead and is the dominant, lowest-risk +lever; it is bit-exact and shipped here. STEP 2 (gather fusion) has a proven ggml +precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLLM +after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel +track. No paged/graph/block-table change can move decode on this model (full +attention is 0.4% of decode).