feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)

Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
both engines pinned the largest llama-specific overage to the gated-DeltaNet
output projection (ssm_out).

The GDN op left its output in SSM layout and the graph reshaped it to 3D
[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
ssm_out weight read across the 128 sequences. vLLM packs the same projection into
one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.

The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the
redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs
3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe /
qwen3next); other archs untouched.

Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
test-backend-ops MUL_MAT and MUL_MAT_ID OK.

decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
  dense q36-27b:     170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
  MoE   q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).

nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER
per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
vs 2.77 ms/call for the old GEMV.

Mirrors DGX dev-tree commit df1cc97.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-25 10:41:38 +00:00
parent c0e0ed3865
commit b895f4dff8
2 changed files with 302 additions and 0 deletions

View File

@@ -0,0 +1,225 @@
From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 12:40:49 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape
(patch 0020)
Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
both engines pinned the largest llama-specific overage to the gated-DeltaNet
OUTPUT projection (ssm_out).
The GDN op left its output in SSM layout and the graph reshaped it to 3D
[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step,
the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one
M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across
all 128 tokens). The result is then already 2D, so the redundant post-matmul
reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical.
Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs
untouched.
Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
test-backend-ops MUL_MAT and MUL_MAT_ID OK.
decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) with a LOWER
per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
vs 2.77 ms/call for the old GEMV.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++
src/models/qwen35.cpp | 13 ++++---
src/models/qwen35moe.cpp | 13 ++++---
src/models/qwen3next.cpp | 13 ++++---
4 files changed, 98 insertions(+), 18 deletions(-)
create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md
diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md
new file mode 100644
index 0000000..9a5721f
--- /dev/null
+++ b/LEVER1_OPROJ_MMQ_RESULTS.md
@@ -0,0 +1,77 @@
+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
+
+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
+bit-exact tensor reshape that re-routes the per-layer SSM output projection
+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
+
+## The mechanism (profiled, both engines)
+
+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
+the ssm_out weight read across the 128 sequences. vLLM packs the same projection
+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
+only the output projection was in 3D SSM layout.
+
+## The fix
+
+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
+proven by the in-projection.
+
+```
+- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+ ...
+ cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+```
+
+## Gates (all PASS)
+
+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
+ post-SSM baseline build:
+ - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
+ - MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
+- Coherent dense + MoE output (greedy text inspected).
+
+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
+
+S_TG t/s (decode aggregate):
+
+| model | npl | baseline | Lever 1 | delta |
+|------------------|-----|----------|---------|---------|
+| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% |
+| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% |
+| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% |
+| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% |
+
+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
+up from 65% post-SSM).
+
+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
+
+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
+
+| kernel | baseline | Lever 1 |
+|-------------------------------------|--------------------|------------------|
+| mul_mat_vec_q<NVFP4, m=1> (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst |
+| mul_mat_q<NVFP4, m=128> | 5463 ms / 8800 inst| 5827 ms /10000 inst|
+
+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
+(+1200 instances, +363 ms over the window), and its per-call average DROPS
+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
+old GEMV: the amortized weight read is the win.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 0be3247..0874c43 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index 2995f04..1f6f643 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
index 97200a4..bfdf026 100644
--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
--
2.43.0

View File

@@ -0,0 +1,77 @@
# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
bit-exact tensor reshape that re-routes the per-layer SSM output projection
from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
## The mechanism (profiled, both engines)
Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
the ssm_out weight read across the 128 sequences. vLLM packs the same projection
into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
only the output projection was in 3D SSM layout.
## The fix
In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
proven by the in-projection.
```
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
...
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
```
## Gates (all PASS)
- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
post-SSM baseline build:
- dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
- MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
- Coherent dense + MoE output (greedy text inspected).
## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
S_TG t/s (decode aggregate):
| model | npl | baseline | Lever 1 | delta |
|------------------|-----|----------|---------|---------|
| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% |
| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% |
| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% |
| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% |
Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
up from 65% post-SSM).
## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
| kernel | baseline | Lever 1 |
|-------------------------------------|--------------------|------------------|
| mul_mat_vec_q<NVFP4, m=1> (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst |
| mul_mat_q<NVFP4, m=128> | 5463 ms / 8800 inst| 5827 ms /10000 inst|
The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
(+1200 instances, +363 ms over the window), and its per-call average DROPS
(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
old GEMV: the amortized weight read is the win.
Assisted-by: Claude:opus-4.8 [Claude Code]