feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)

Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling both engines pinned the largest llama-specific overage to the gated-DeltaNet output projection (ssm_out). The GDN op left its output in SSM layout and the graph reshaped it to 3D [value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the ssm_out weight read across the 128 sequences. vLLM packs the same projection into one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] (= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs untouched. Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. test-backend-ops MUL_MAT and MUL_MAT_ID OK. decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call vs 2.77 ms/call for the old GEMV. Mirrors DGX dev-tree commit df1cc97. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-25 09:09:07 -04:00 · 2026-06-25 10:41:38 +00:00
parent c0e0ed3865
commit b895f4dff8
2 changed files with 302 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
@@ -0,0 +1,225 @@
+From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 12:40:49 +0200
+Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape
+ (patch 0020)
+
+Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
+models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
+(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
+both engines pinned the largest llama-specific overage to the gated-DeltaNet
+OUTPUT projection (ssm_out).
+
+The GDN op left its output in SSM layout and the graph reshaped it to 3D
+[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
+src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
+sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
+ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step,
+the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one
+M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
+
+The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
+(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
+routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across
+all 128 tokens). The result is then already 2D, so the redundant post-matmul
+reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical.
+Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs
+untouched.
+
+Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
+q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
+test-backend-ops MUL_MAT and MUL_MAT_ID OK.
+
+decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
+  dense q36-27b:    170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
+  MoE   q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
+Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
+
+nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
+to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) with a LOWER
+per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
+vs 2.77 ms/call for the old GEMV.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++
+ src/models/qwen35.cpp       | 13 ++++---
+ src/models/qwen35moe.cpp    | 13 ++++---
+ src/models/qwen3next.cpp    | 13 ++++---
+ 4 files changed, 98 insertions(+), 18 deletions(-)
+ create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md
+
+diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md
+new file mode 100644
+index 0000000..9a5721f
+--- /dev/null
+++ b/LEVER1_OPROJ_MMQ_RESULTS.md
+@@ -0,0 +1,77 @@
+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
+
+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
+bit-exact tensor reshape that re-routes the per-layer SSM output projection
+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
+
+## The mechanism (profiled, both engines)
+
+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
+the ssm_out weight read across the 128 sequences. vLLM packs the same projection
+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
+only the output projection was in 3D SSM layout.
+
+## The fix
+
+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
+proven by the in-projection.
+
+```
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
++    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     ...
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+```
+
+## Gates (all PASS)
+
+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
+  post-SSM baseline build:
+  - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
+  - MoE   q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
+- Coherent dense + MoE output (greedy text inspected).
+
+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
+
+S_TG t/s (decode aggregate):
+
+| model            | npl | baseline | Lever 1 | delta   |
+|------------------|-----|----------|---------|---------|
+| dense q36-27b    |  32 |   170.52 |  200.00 | +17.3%  |
+| dense q36-27b    | 128 |   254.92 |  335.80 | +31.7%  |
+| MoE   q36-35b-a3b|  32 |   373.28 |  420.77 | +12.7%  |
+| MoE   q36-35b-a3b| 128 |   560.66 |  691.24 | +23.3%  |
+
+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
+up from 65% post-SSM).
+
+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
+
+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
+
+| kernel                              | baseline           | Lever 1          |
+|-------------------------------------|--------------------|------------------|
+| mul_mat_vec_q<NVFP4, m=1> (o_proj)  | 132.8 ms / 48 inst | 0 ms / 0 inst    |
+| mul_mat_q<NVFP4, m=128>             | 5463 ms / 8800 inst| 5827 ms /10000 inst|
+
+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
+(+1200 instances, +363 ms over the window), and its per-call average DROPS
+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
+old GEMV: the amortized weight read is the win.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
+index 0be3247..0874c43 100644
+--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
+@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
+     // Apply gated normalization: self.norm(core_attn_out, z)
+     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
+ 
+-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+    // data, just a 2D vs 3D view, so the result is bit-identical.
+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     cb(final_output, "final_output", il);
+ 
+-    // Output projection
+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+     cb(cur, "linear_attn_out", il);
+ 
+-    // Reshape back to original dimensions
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+-
+     return cur;
+ }
+ 
+diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
+index 2995f04..1f6f643 100644
+--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
+@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
+     // Apply gated normalization: self.norm(core_attn_out, z)
+     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
+ 
+-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+    // data, just a 2D vs 3D view, so the result is bit-identical.
+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     cb(final_output, "final_output", il);
+ 
+-    // Output projection
+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+     cb(cur, "linear_attn_out", il);
+ 
+-    // Reshape back to original dimensions
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+-
+     return cur;
+ }
+ 
+diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
+index 97200a4..bfdf026 100644
+--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
+@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
+     // Apply gated normalization: self.norm(core_attn_out, z)
+     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
+ 
+-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+    // data, just a 2D vs 3D view, so the result is bit-identical.
+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     cb(final_output, "final_output", il);
+ 
+-    // Output projection
+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output);
+     cb(cur, "linear_attn_out", il);
+ 
+-    // Reshape back to original dimensions
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+-
+     return cur;
+ }
+ 
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md
@@ -0,0 +1,77 @@
+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
+
+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
+bit-exact tensor reshape that re-routes the per-layer SSM output projection
+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
+
+## The mechanism (profiled, both engines)
+
+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
+the ssm_out weight read across the 128 sequences. vLLM packs the same projection
+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
+only the output projection was in 3D SSM layout.
+
+## The fix
+
+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
+proven by the in-projection.
+
+```
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     ...
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+```
+
+## Gates (all PASS)
+
+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
+  post-SSM baseline build:
+  - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
+  - MoE   q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
+- Coherent dense + MoE output (greedy text inspected).
+
+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
+
+S_TG t/s (decode aggregate):
+
+| model            | npl | baseline | Lever 1 | delta   |
+|------------------|-----|----------|---------|---------|
+| dense q36-27b    |  32 |   170.52 |  200.00 | +17.3%  |
+| dense q36-27b    | 128 |   254.92 |  335.80 | +31.7%  |
+| MoE   q36-35b-a3b|  32 |   373.28 |  420.77 | +12.7%  |
+| MoE   q36-35b-a3b| 128 |   560.66 |  691.24 | +23.3%  |
+
+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
+up from 65% post-SSM).
+
+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
+
+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
+
+| kernel                              | baseline           | Lever 1          |
+|-------------------------------------|--------------------|------------------|
+| mul_mat_vec_q<NVFP4, m=1> (o_proj)  | 132.8 ms / 48 inst | 0 ms / 0 inst    |
+| mul_mat_q<NVFP4, m=128>             | 5463 ms / 8800 inst| 5827 ms /10000 inst|
+
+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
+(+1200 instances, +363 ms over the window), and its per-call average DROPS
+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
+old GEMV: the amortized weight read is the win.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]