patches(paged): track 0044 GatedRMSNorm patch, sync LocalAI series to fork 51168c5

The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for the paged-backend patch series. This file is the git format-patch
of fork commit 51168c5ee ("feat(paged): fused gated RMSNorm + SiLU gate-mul
CUDA op (patch 0044)"), verified byte-identical to that commit's format-patch
output. The full on-disk series applies clean in numeric order on the pinned
base and the resulting tree is byte-identical to the fork commit tree (tree
hash a73d759350277532a14e853e1fe78f08bbb74ce8), so the LocalAI series is a
drift-free 1:1 mirror of the fork branch.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-30 15:10:13 +00:00
parent 8bb47e5a8a
commit 2033086f60

View File

@@ -0,0 +1,470 @@
From 51168c5eee2e35348d9006f0b2fab3dc6e7c01cc Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 30 Jun 2026 10:57:05 +0200
Subject: [PATCH] feat(paged): fused gated RMSNorm + SiLU gate-mul CUDA op
(patch 0044)
The Qwen3.6 gated-DeltaNet output norm self.norm(core_attn_out, z)
(qwen35 / qwen35moe build_norm_gated) runs as (rms_norm(x) * w) * silu(z):
on CUDA that was rms_norm_mul + silu_mul, two fused launches with the
normalized intermediate round-tripping through HBM. Fuse the whole chain
into one kernel so it stays in registers. This is the gated-RMSNorm fusion
the vLLM decode-gap analysis ranked #1 (the easy, bit-exact prefill win),
a direct sibling of patch 0042 (add-RMSNorm).
The chain is NOT naturally consecutive in the graph: the gate z-projection
(a MUL_MAT) is scheduled between the weight MUL and the SILU, so the default
mul(normalized, silu(z)) order leaves a GEMM between them and cannot be
fused. build_norm_gated now emits the gate multiply as mul(silu(z),
normalized) (commutative, so bit-exact), which lays the chain out as the
consecutive subgraph { SILU, RMS_NORM, MUL, MUL } that ggml-cuda can fuse.
- New kernel rms_norm_gate_mul_f32 (ggml/src/ggml-cuda/norm.cu): same
block_reduce<SUM> over x^2, same 256/1024 block-size thresholds and
rsqrtf(mean+eps) as rms_norm / patch 0042; the final write computes
dst = scale * x * w * silu(z) with silu(z) = z/(1+expf(-z)) (the exact
ggml_cuda_op_silu_single form). w (the RMS weight) and z (the gate) both
broadcast via the packed-modulo helper.
- ggml_cuda_can_fuse recognizes { GGML_OP_UNARY(SILU), RMS_NORM, MUL, MUL }
via ggml_can_fuse_subgraph with the final MUL as the only output (the SILU
reads an external gate; RMS_NORM and the weight MUL are single-use within).
- Gated by LLAMA_FUSE_GATE_RMSNORM (default ON) for a clean single-build A/B;
OFF keeps the original operand order AND the unfused kernels, so OFF is
byte- and kernel-identical to the pre-patch path.
BIT-EXACT (per-path canonical greedy md5, n=48 --temp 0 --seed 1):
dense q36-27b-nvfp4 : 5951a5b4d624ce891e22ab5fca9bc439 (ON == OFF == canonical, paged and non-paged)
MoE q36-35b-a3b : 8cb0ce23777bf55f92f63d0292c756b0 (ON == OFF == canonical, paged)
Multiply is commutative, so ((scale*x)*w)*silu(z) is byte-identical to the
unfused silu(z)*((scale*x)*w); the sum(x^2) reduction and rsqrt scale are
unchanged. test-backend-ops 12979/12979 (CUDA0 vs CPU).
PROFILE (dense prefill, nsys --cuda-graph-trace=node, npp512 ntg4 npl8):
rms_norm_f32<256,1,0> 560 -> 224 launches
unary_gated_op_kernel<op_silu> 784 -> 448 launches
rms_norm_gate_mul_f32 (new) 336 launches / 69.7M ns
-> the 336 gated-norm rms_norm_mul + 336 silu_mul launches (672) fold into
336 fused launches, removing the normalized HBM round-trip.
S_PP (npp512 ntg4 npl32, 3x interleaved A/B, every ON beats every OFF):
dense q36-27b : 1002.5 -> 1013.4 t/s (+1.1%, ~+10 us/tok)
MoE q36-35b : 2626.9 -> 2651.8 t/s (+0.9%)
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/ggml-cuda.cu | 67 ++++++++++
ggml/src/ggml-cuda/norm.cu | 215 ++++++++++++++++++++++++++++++++
ggml/src/ggml-cuda/norm.cuh | 6 +
src/models/qwen35.cpp | 16 +++
src/models/qwen35moe.cpp | 16 +++
5 files changed, 320 insertions(+)
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 42bcd4a77..374949f25 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -3816,6 +3816,60 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph,
return true;
}
+ // Fused gated RMS norm: SiLU gate multiply over (RMS norm * weight), the
+ // gated-DeltaNet output norm `out = (rms_norm(x) * w) * silu(z)` of the Qwen3.6
+ // hybrid models (qwen35 / qwen35moe build_norm_gated). The model emits the gate
+ // multiply as mul(silu(z), normalized) (default; see LLAMA_FUSE_GATE_RMSNORM in
+ // build_norm_gated) so the chain forms the consecutive subgraph
+ // { SILU, RMS_NORM, MUL, MUL } - the gate z-projection is scheduled before the
+ // SILU, so the natural mul(normalized, silu) order leaves a GEMM between the
+ // weight MUL and the SILU and cannot be fused. The SILU (node_idx) reads an
+ // external gate and the final gate MUL (node_idx + 3) feeds the o_proj, so mark
+ // node_idx + 3 as the only output; the RMS_NORM (node_idx + 1) and weight MUL
+ // (node_idx + 2) are single-use within the subgraph.
+ std::initializer_list<enum ggml_op> rms_norm_gate_mul_ops = { GGML_OP_UNARY, GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_MUL };
+ if (is_equal(rms_norm_gate_mul_ops, ops) && unary_ops.size() == 1 && unary_ops.begin()[0] == GGML_UNARY_OP_SILU &&
+ ggml_can_fuse_subgraph(cgraph, node_idx, ops, { node_idx + 3 })) {
+ const ggml_tensor * silu = cgraph->nodes[node_idx];
+ const ggml_tensor * rms_norm = cgraph->nodes[node_idx + 1];
+ const ggml_tensor * mul = cgraph->nodes[node_idx + 2];
+ const ggml_tensor * gate_mul = cgraph->nodes[node_idx + 3];
+
+ if (ggml_get_unary_op(silu) != GGML_UNARY_OP_SILU) {
+ return false;
+ }
+ // The weight MUL must consume the RMS norm output; the gate MUL must
+ // consume both the weight MUL and the SILU output.
+ if (mul->src[0] != rms_norm && mul->src[1] != rms_norm) {
+ return false;
+ }
+ if ((gate_mul->src[0] != mul && gate_mul->src[1] != mul) ||
+ (gate_mul->src[0] != silu && gate_mul->src[1] != silu)) {
+ return false;
+ }
+ // All operands F32 (rms norm / fused mul / silu kernel only support F32).
+ if (rms_norm->src[0]->type != GGML_TYPE_F32 || rms_norm->type != GGML_TYPE_F32 ||
+ mul->src[0]->type != GGML_TYPE_F32 || mul->src[1]->type != GGML_TYPE_F32 || mul->type != GGML_TYPE_F32 ||
+ silu->src[0]->type != GGML_TYPE_F32 || silu->type != GGML_TYPE_F32 ||
+ gate_mul->src[0]->type != GGML_TYPE_F32 || gate_mul->src[1]->type != GGML_TYPE_F32 ||
+ gate_mul->type != GGML_TYPE_F32) {
+ return false;
+ }
+ // If rms_norm is the B operand of the weight mul, broadcast of A is unsupported.
+ if (rms_norm == mul->src[1] && !ggml_are_same_shape(mul->src[0], rms_norm)) {
+ return false;
+ }
+ // The fused kernel reads contiguous rows for the norm input, the weight,
+ // and the gate, and writes a contiguous output.
+ if (!ggml_is_contiguous_rows(rms_norm->src[0]) ||
+ !ggml_is_contiguous_rows(mul->src[0]) || !ggml_is_contiguous_rows(mul->src[1]) ||
+ !ggml_is_contiguous_rows(silu->src[0]) ||
+ !ggml_is_contiguous_rows(gate_mul->src[0]) || !ggml_is_contiguous_rows(gate_mul->src[1])) {
+ return false;
+ }
+ return true;
+ }
+
if (!ggml_can_fuse(cgraph, node_idx, ops)) {
return false;
}
@@ -4350,6 +4404,19 @@ static int ggml_cuda_try_fuse(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph
return 2;
}
+ // Fused gated RMS norm: RMS norm + weight multiply + SiLU-gated multiply
+ // (bit-exact). The Qwen3.6 gated-DeltaNet output norm. Default ON; set
+ // LLAMA_FUSE_GATE_RMSNORM=0 for a clean A/B against the unfused path.
+ static const bool fuse_gate_rmsnorm = [] {
+ const char * e = getenv("LLAMA_FUSE_GATE_RMSNORM");
+ return e == nullptr || atoi(e) != 0;
+ }();
+ if (fuse_gate_rmsnorm &&
+ ggml_cuda_can_fuse(cgraph, i, { GGML_OP_UNARY, GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_MUL }, { GGML_UNARY_OP_SILU })) {
+ ggml_cuda_op_rms_norm_gate_mul(*cuda_ctx, cgraph->nodes[i + 1], cgraph->nodes[i + 2], node, cgraph->nodes[i + 3]);
+ return 3;
+ }
+
if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_ADD }, {})) {
ggml_cuda_op_rms_norm_fused_add(*cuda_ctx, node, cgraph->nodes[i + 1], cgraph->nodes[i + 2]);
return 2;
diff --git a/ggml/src/ggml-cuda/norm.cu b/ggml/src/ggml-cuda/norm.cu
index a07d02276..e776c67d2 100644
--- a/ggml/src/ggml-cuda/norm.cu
+++ b/ggml/src/ggml-cuda/norm.cu
@@ -235,6 +235,95 @@ static __global__ void rms_norm_pre_add_mul_f32(const float * a,
}
}
+// Fused gated RMS norm: RMS norm + weight multiply + SiLU gate multiply.
+// dst = (rsqrt(mean(x^2)+eps) * x * w) * silu(z) with silu(z) = z/(1+expf(-z))
+// This is the gated-DeltaNet output norm `self.norm(core_attn_out, z)` of the
+// Qwen3.6 hybrid models (build_norm_gated): rms_norm(x) scaled by the per-head
+// ssm_norm weight `w`, then gated by silu of the gate activation `z`. Unfused it
+// runs as rms_norm_mul (scale*x*w) -> silu(z) -> mul; fusing it keeps the
+// normalized intermediate in registers so it never round-trips to HBM.
+//
+// Bit-exactness: the sum(x^2) reduction uses the same block_reduce<SUM> with the
+// same 256/1024 block-size thresholds and the same rsqrtf(mean+eps) as rms_norm,
+// the weight multiply reproduces rms_norm_mul's `scale*x[col]*w[col]` order, and
+// silu reuses the exact `z/(1+expf(-z))` of ggml_cuda_op_silu_single. Float
+// multiply is commutative, so `(scale*x*w) * silu(z)` is byte-identical to the
+// unfused `mul(rms_norm_mul, silu(z))` (whether or not silu+mul was itself fused).
+// `w` (the RMS weight) and `z` (the gate) both broadcast via the packed-modulo path.
+template <int block_size>
+static __global__ void rms_norm_gate_mul_f32(const float * x,
+ float * dst,
+ const int ncols,
+ const int64_t stride_row,
+ const int64_t stride_channel,
+ const int64_t stride_sample,
+ const float eps,
+ const float * mul,
+ const int64_t mul_stride_row,
+ const int64_t mul_stride_channel,
+ const int64_t mul_stride_sample,
+ const uint3 mul_ncols_packed,
+ const uint3 mul_nrows_packed,
+ const uint3 mul_nchannels_packed,
+ const uint3 mul_nsamples_packed,
+ const float * gate,
+ const int64_t gate_stride_row,
+ const int64_t gate_stride_channel,
+ const int64_t gate_stride_sample,
+ const uint3 gate_ncols_packed,
+ const uint3 gate_nrows_packed,
+ const uint3 gate_nchannels_packed,
+ const uint3 gate_nsamples_packed) {
+ ggml_cuda_pdl_lc();
+ const int nrows = gridDim.x;
+ const int nchannels = gridDim.y;
+
+ const int row = blockIdx.x;
+ const int channel = blockIdx.y;
+ const int sample = blockIdx.z;
+ const int tid = threadIdx.x;
+
+ x += sample*stride_sample + channel*stride_channel + row*stride_row;
+ // dst is laid out contiguously by the scheduler for the (final) MUL output
+ dst += ((sample*nchannels + channel)*nrows + row)*ncols;
+
+ {
+ const uint32_t mul_row = fastmodulo(row, mul_nrows_packed);
+ const uint32_t mul_channel = fastmodulo(channel, mul_nchannels_packed);
+ const uint32_t mul_sample = fastmodulo(sample, mul_nsamples_packed);
+ mul += mul_sample * mul_stride_sample + mul_channel * mul_stride_channel + mul_row * mul_stride_row;
+ }
+ {
+ const uint32_t gate_row = fastmodulo(row, gate_nrows_packed);
+ const uint32_t gate_channel = fastmodulo(channel, gate_nchannels_packed);
+ const uint32_t gate_sample = fastmodulo(sample, gate_nsamples_packed);
+ gate += gate_sample * gate_stride_sample + gate_channel * gate_stride_channel + gate_row * gate_stride_row;
+ }
+
+ float tmp = 0.0f; // partial sum for thread in warp
+
+ ggml_cuda_pdl_sync();
+ for (int col = tid; col < ncols; col += block_size) {
+ const float xi = x[col];
+ tmp += xi * xi;
+ }
+
+ // sum up partial sums
+ extern __shared__ float s_sum[];
+ tmp = block_reduce<block_reduce_method::SUM, block_size>(tmp, s_sum);
+
+ const float mean = tmp / ncols;
+ const float scale = rsqrtf(mean + eps);
+
+ for (int col = tid; col < ncols; col += block_size) {
+ const int mul_col = fastmodulo(col, mul_ncols_packed);
+ const int gate_col = fastmodulo(col, gate_ncols_packed);
+ const float zi = gate[gate_col];
+ const float silu_z = zi / (1.0f + expf(-zi));
+ dst[col] = scale * x[col] * mul[mul_col] * silu_z;
+ }
+}
+
template <int block_size>
static __global__ void rms_norm_back_f32(
const float * grad, const float * xf, float * dst, const int ncols, const float eps) {
@@ -532,6 +621,65 @@ static void rms_norm_pre_add_mul_f32_cuda(const float * a,
}
}
+static void rms_norm_gate_mul_f32_cuda(const float * x,
+ float * dst,
+ const int ncols,
+ const int nrows,
+ const int nchannels,
+ const int nsamples,
+ const int64_t stride_row,
+ const int64_t stride_channel,
+ const int64_t stride_sample,
+ const float * mul,
+ const int64_t mul_stride_row,
+ const int64_t mul_stride_channel,
+ const int64_t mul_stride_sample,
+ const uint32_t mul_ncols,
+ const uint32_t mul_nrows,
+ const uint32_t mul_nchannels,
+ const uint32_t mul_nsamples,
+ const float * gate,
+ const int64_t gate_stride_row,
+ const int64_t gate_stride_channel,
+ const int64_t gate_stride_sample,
+ const uint32_t gate_ncols,
+ const uint32_t gate_nrows,
+ const uint32_t gate_nchannels,
+ const uint32_t gate_nsamples,
+ const float eps,
+ cudaStream_t stream) {
+ const dim3 blocks_num(nrows, nchannels, nsamples);
+ GGML_ASSERT(mul != nullptr);
+ GGML_ASSERT(gate != nullptr);
+ const uint3 mul_ncols_packed = init_fastdiv_values(mul_ncols);
+ const uint3 mul_nrows_packed = init_fastdiv_values(mul_nrows);
+ const uint3 mul_nchannels_packed = init_fastdiv_values(mul_nchannels);
+ const uint3 mul_nsamples_packed = init_fastdiv_values(mul_nsamples);
+ const uint3 gate_ncols_packed = init_fastdiv_values(gate_ncols);
+ const uint3 gate_nrows_packed = init_fastdiv_values(gate_nrows);
+ const uint3 gate_nchannels_packed = init_fastdiv_values(gate_nchannels);
+ const uint3 gate_nsamples_packed = init_fastdiv_values(gate_nsamples);
+ if (ncols < 1024) {
+ const dim3 block_dims(256, 1, 1);
+ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream};
+ ggml_cuda_kernel_launch(rms_norm_gate_mul_f32<256>, launch_params,
+ x, dst, ncols, stride_row, stride_channel, stride_sample, eps,
+ mul, mul_stride_row, mul_stride_channel, mul_stride_sample,
+ mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed,
+ gate, gate_stride_row, gate_stride_channel, gate_stride_sample,
+ gate_ncols_packed, gate_nrows_packed, gate_nchannels_packed, gate_nsamples_packed);
+ } else {
+ const dim3 block_dims(1024, 1, 1);
+ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream};
+ ggml_cuda_kernel_launch(rms_norm_gate_mul_f32<1024>, launch_params,
+ x, dst, ncols, stride_row, stride_channel, stride_sample, eps,
+ mul, mul_stride_row, mul_stride_channel, mul_stride_sample,
+ mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed,
+ gate, gate_stride_row, gate_stride_channel, gate_stride_sample,
+ gate_ncols_packed, gate_nrows_packed, gate_nchannels_packed, gate_nsamples_packed);
+ }
+}
+
static void rms_norm_back_f32_cuda(const float * grad, const float * xf, float * dst, const int ncols, const int nrows, const float eps, cudaStream_t stream) {
if (ncols < 1024) {
const dim3 block_dims(WARP_SIZE, 1, 1);
@@ -843,6 +991,73 @@ void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx,
eps, stream);
}
+void ggml_cuda_op_rms_norm_gate_mul(ggml_backend_cuda_context & ctx,
+ ggml_tensor * rms_norm_tensor,
+ ggml_tensor * mul_tensor,
+ ggml_tensor * silu_tensor,
+ ggml_tensor * gate_mul_tensor) {
+ // mul = rms_norm(x) * w ; silu = silu(z) ; gate_mul = mul * silu
+ GGML_ASSERT(mul_tensor->src[0] == rms_norm_tensor || mul_tensor->src[1] == rms_norm_tensor);
+ GGML_ASSERT(gate_mul_tensor->src[0] == silu_tensor || gate_mul_tensor->src[1] == silu_tensor);
+
+ const ggml_tensor * x_src = rms_norm_tensor->src[0];
+ const ggml_tensor * w_src = (mul_tensor->src[0] == rms_norm_tensor) ? mul_tensor->src[1] : mul_tensor->src[0];
+ const ggml_tensor * gate_src = silu_tensor->src[0];
+
+ float eps = 0.0f;
+ memcpy(&eps, rms_norm_tensor->op_params, sizeof(float));
+ GGML_ASSERT(eps >= 0.0f);
+
+ const float * x_d = (const float *) x_src->data;
+ const float * w_d = (const float *) w_src->data;
+ const float * gate_d = (const float *) gate_src->data;
+ float * dst_d = (float *) gate_mul_tensor->data;
+ cudaStream_t stream = ctx.stream();
+
+ GGML_ASSERT(x_src->type == GGML_TYPE_F32);
+ GGML_ASSERT(w_src->type == GGML_TYPE_F32);
+ GGML_ASSERT(gate_src->type == GGML_TYPE_F32);
+ GGML_ASSERT(rms_norm_tensor->type == GGML_TYPE_F32);
+ GGML_ASSERT(mul_tensor->type == GGML_TYPE_F32);
+ GGML_ASSERT(silu_tensor->type == GGML_TYPE_F32);
+ GGML_ASSERT(gate_mul_tensor->type == GGML_TYPE_F32);
+
+ const int64_t ne00 = rms_norm_tensor->ne[0];
+ const int64_t ne01 = rms_norm_tensor->ne[1];
+ const int64_t ne02 = rms_norm_tensor->ne[2];
+ const int64_t ne03 = rms_norm_tensor->ne[3];
+
+ // x (the rms-norm input) strides; cols must be contiguous
+ const size_t ts0 = ggml_type_size(x_src->type);
+ GGML_ASSERT(x_src->nb[0] == ts0);
+ const int64_t s01 = x_src->nb[1] / ts0;
+ const int64_t s02 = x_src->nb[2] / ts0;
+ const int64_t s03 = x_src->nb[3] / ts0;
+
+ // weight (the RMS scale) strides + broadcast extents
+ const size_t ts_mul = ggml_type_size(w_src->type);
+ GGML_ASSERT(w_src->nb[0] == ts_mul);
+ const int64_t mul_s01 = w_src->nb[1] / ts_mul;
+ const int64_t mul_s02 = w_src->nb[2] / ts_mul;
+ const int64_t mul_s03 = w_src->nb[3] / ts_mul;
+
+ // gate (the silu activation) strides + broadcast extents
+ const size_t ts_gate = ggml_type_size(gate_src->type);
+ GGML_ASSERT(gate_src->nb[0] == ts_gate);
+ const int64_t gate_s01 = gate_src->nb[1] / ts_gate;
+ const int64_t gate_s02 = gate_src->nb[2] / ts_gate;
+ const int64_t gate_s03 = gate_src->nb[3] / ts_gate;
+
+ rms_norm_gate_mul_f32_cuda(x_d, dst_d,
+ ne00, ne01, ne02, ne03,
+ /*s00*/ s01, s02, s03,
+ w_d, /*mul_s00*/ mul_s01, mul_s02, mul_s03,
+ w_src->ne[0], w_src->ne[1], w_src->ne[2], w_src->ne[3],
+ gate_d, /*gate_s00*/ gate_s01, gate_s02, gate_s03,
+ gate_src->ne[0], gate_src->ne[1], gate_src->ne[2], gate_src->ne[3],
+ eps, stream);
+}
+
void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const ggml_tensor * grad = dst->src[0]; // gradients
const ggml_tensor * src0f = dst->src[1]; // src0 from forward pass
diff --git a/ggml/src/ggml-cuda/norm.cuh b/ggml/src/ggml-cuda/norm.cuh
index 05396cdf0..4d6dba6fa 100644
--- a/ggml/src/ggml-cuda/norm.cuh
+++ b/ggml/src/ggml-cuda/norm.cuh
@@ -17,6 +17,12 @@ void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx,
ggml_tensor * add_tensor,
ggml_tensor * rms_norm_tensor,
ggml_tensor * mul_tensor);
+void ggml_cuda_op_rms_norm_gate_mul(ggml_backend_cuda_context & ctx,
+ ggml_tensor * rms_norm_tensor,
+ ggml_tensor * mul_tensor,
+ ggml_tensor * silu_tensor,
+ ggml_tensor * gate_mul_tensor);
+
void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 66064869e..98751f7cc 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -1,4 +1,5 @@
#include "models.h"
+#include <cstdlib>
#include "llama-memory-recurrent.h"
void llama_model_qwen35::load_arch_hparams(llama_model_loader & ml) {
@@ -251,6 +252,21 @@ ggml_tensor * llama_model_qwen35::graph::build_norm_gated(
ggml_tensor * normalized = build_norm(input, weights, nullptr, LLM_NORM_RMS, layer);
ggml_tensor * gated_silu = ggml_silu(ctx0, gate);
+ // Emit the gate multiply as mul(silu(z), normalized) so the gated-DeltaNet
+ // output-norm chain forms the consecutive subgraph { SILU, RMS_NORM, MUL, MUL }
+ // that the CUDA backend fuses into one rms_norm_gate_mul kernel (the normalized
+ // intermediate then never round-trips to HBM). The gate z-projection is scheduled
+ // before the SILU, so the natural mul(normalized, silu) order leaves a GEMM
+ // between the weight MUL and the SILU and is not fusable. Multiplication is
+ // commutative, so this is bit-exact vs mul(normalized, silu).
+ // LLAMA_FUSE_GATE_RMSNORM=0 keeps the original operand order (kernel fusion off).
+ static const bool fuse_gate_rmsnorm = [] {
+ const char * e = getenv("LLAMA_FUSE_GATE_RMSNORM");
+ return e == nullptr || atoi(e) != 0;
+ }();
+ if (fuse_gate_rmsnorm) {
+ return ggml_mul(ctx0, gated_silu, normalized);
+ }
return ggml_mul(ctx0, normalized, gated_silu);
}
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index a79917628..071b88daa 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -1,4 +1,5 @@
#include "models.h"
+#include <cstdlib>
#include "llama-memory-recurrent.h"
void llama_model_qwen35moe::load_arch_hparams(llama_model_loader & ml) {
@@ -275,6 +276,21 @@ ggml_tensor * llama_model_qwen35moe::graph::build_norm_gated(
ggml_tensor * normalized = build_norm(input, weights, nullptr, LLM_NORM_RMS, layer);
ggml_tensor * gated_silu = ggml_silu(ctx0, gate);
+ // Emit the gate multiply as mul(silu(z), normalized) so the gated-DeltaNet
+ // output-norm chain forms the consecutive subgraph { SILU, RMS_NORM, MUL, MUL }
+ // that the CUDA backend fuses into one rms_norm_gate_mul kernel (the normalized
+ // intermediate then never round-trips to HBM). The gate z-projection is scheduled
+ // before the SILU, so the natural mul(normalized, silu) order leaves a GEMM
+ // between the weight MUL and the SILU and is not fusable. Multiplication is
+ // commutative, so this is bit-exact vs mul(normalized, silu).
+ // LLAMA_FUSE_GATE_RMSNORM=0 keeps the original operand order (kernel fusion off).
+ static const bool fuse_gate_rmsnorm = [] {
+ const char * e = getenv("LLAMA_FUSE_GATE_RMSNORM");
+ return e == nullptr || atoi(e) != 0;
+ }();
+ if (fuse_gate_rmsnorm) {
+ return ggml_mul(ctx0, gated_silu, normalized);
+ }
return ggml_mul(ctx0, normalized, gated_silu);
}
--
2.43.0