From b895f4dff8ce0c076e9478b63ab04de335337a18 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 25 Jun 2026 10:41:38 +0000 Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020) Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling both engines pinned the largest llama-specific overage to the gated-DeltaNet output projection (ssm_out). The GDN op left its output in SSM layout and the graph reshaped it to 3D [value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the ssm_out weight read across the 128 sequences. vLLM packs the same projection into one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] (= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs untouched. Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. test-backend-ops MUL_MAT and MUL_MAT_ID OK. decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). nsys: the o_proj mul_mat_vec_q bucket (132.8 ms / 48 inst) collapses to zero; mul_mat_q absorbs it (+1200 inst, +363 ms) at a LOWER per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call vs 2.77 ms/call for the old GEMV. Mirrors DGX dev-tree commit df1cc97. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../0020-qwen35-gdn-oproj-mmq-reshape.patch | 225 ++++++++++++++++++ .../patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md | 77 ++++++ 2 files changed, 302 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch create mode 100644 backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md diff --git a/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch new file mode 100644 index 000000000..811061137 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch @@ -0,0 +1,225 @@ +From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 12:40:49 +0200 +Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape + (patch 0020) + +Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM +models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM +(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling +both engines pinned the largest llama-specific overage to the gated-DeltaNet +OUTPUT projection (ssm_out). + +The GDN op left its output in SSM layout and the graph reshaped it to 3D +[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so +src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 +sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the +ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step, +the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one +M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. + +The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] +(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 +routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across +all 128 tokens). The result is then already 2D, so the redundant post-matmul +reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. +Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs +untouched. + +Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both +q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. +test-backend-ops MUL_MAT and MUL_MAT_ID OK. + +decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): + dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) + MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) +Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). + +nsys: the o_proj mul_mat_vec_q bucket (132.8 ms / 48 inst) collapses +to zero; mul_mat_q absorbs it (+1200 inst, +363 ms) with a LOWER +per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call +vs 2.77 ms/call for the old GEMV. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++ + src/models/qwen35.cpp | 13 ++++--- + src/models/qwen35moe.cpp | 13 ++++--- + src/models/qwen3next.cpp | 13 ++++--- + 4 files changed, 98 insertions(+), 18 deletions(-) + create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md + +diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md +new file mode 100644 +index 0000000..9a5721f +--- /dev/null ++++ b/LEVER1_OPROJ_MMQ_RESULTS.md +@@ -0,0 +1,77 @@ ++# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020) ++ ++The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models ++(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line, ++bit-exact tensor reshape that re-routes the per-layer SSM output projection ++from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ). ++ ++## The mechanism (profiled, both engines) ++ ++Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391. ++The largest llama-specific overage was the gated-DeltaNet OUTPUT projection ++(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it ++to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so ++`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the ++128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize ++the ssm_out weight read across the 128 sequences. vLLM packs the same projection ++into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ); ++only the output projection was in 3D SSM layout. ++ ++## The fix ++ ++In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse ++the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at ++decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the ++MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`, ++so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a ++2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already ++proven by the in-projection. ++ ++``` ++- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); +++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); ++ ... ++ cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s); ++- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); ++``` ++ ++## Gates (all PASS) ++ ++- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the ++ post-SSM baseline build: ++ - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL) ++ - MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL) ++- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK ++- Coherent dense + MoE output (greedy text inspected). ++ ++## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000) ++ ++S_TG t/s (decode aggregate): ++ ++| model | npl | baseline | Lever 1 | delta | ++|------------------|-----|----------|---------|---------| ++| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% | ++| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% | ++| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% | ++| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% | ++ ++Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded; ++up from 65% post-SSM). ++ ++## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128) ++ ++The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128: ++ ++| kernel | baseline | Lever 1 | ++|-------------------------------------|--------------------|------------------| ++| mul_mat_vec_q (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst | ++| mul_mat_q | 5463 ms / 8800 inst| 5827 ms /10000 inst| ++ ++The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it ++(+1200 instances, +363 ms over the window), and its per-call average DROPS ++(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper ++than the average projection GEMM. Realized o_proj-as-MMQ marginal cost ++~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the ++old GEMV: the amortized weight read is the win. ++ ++Assisted-by: Claude:opus-4.8 [Claude Code] +diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp +index 0be3247..0874c43 100644 +--- a/src/models/qwen35.cpp ++++ b/src/models/qwen35.cpp +@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear( + // Apply gated normalization: self.norm(core_attn_out, z) + ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il); + +- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim] +- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); ++ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the ++ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior ++ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ ++ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous ++ // data, just a 2D vs 3D view, so the result is bit-identical. ++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); + cb(final_output, "final_output", il); + +- // Output projection ++ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs]) + cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s); + cb(cur, "linear_attn_out", il); + +- // Reshape back to original dimensions +- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); +- + return cur; + } + +diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp +index 2995f04..1f6f643 100644 +--- a/src/models/qwen35moe.cpp ++++ b/src/models/qwen35moe.cpp +@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear( + // Apply gated normalization: self.norm(core_attn_out, z) + ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il); + +- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim] +- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); ++ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the ++ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior ++ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ ++ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous ++ // data, just a 2D vs 3D view, so the result is bit-identical. ++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); + cb(final_output, "final_output", il); + +- // Output projection ++ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs]) + cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s); + cb(cur, "linear_attn_out", il); + +- // Reshape back to original dimensions +- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); +- + return cur; + } + +diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp +index 97200a4..bfdf026 100644 +--- a/src/models/qwen3next.cpp ++++ b/src/models/qwen3next.cpp +@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear( + // Apply gated normalization: self.norm(core_attn_out, z) + ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il); + +- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim] +- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); ++ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the ++ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior ++ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ ++ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous ++ // data, just a 2D vs 3D view, so the result is bit-identical. ++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); + cb(final_output, "final_output", il); + +- // Output projection ++ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs]) + cur = build_lora_mm(model.layers[il].ssm_out, final_output); + cb(cur, "linear_attn_out", il); + +- // Reshape back to original dimensions +- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); +- + return cur; + } + +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md b/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md new file mode 100644 index 000000000..9a5721f28 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md @@ -0,0 +1,77 @@ +# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020) + +The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models +(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line, +bit-exact tensor reshape that re-routes the per-layer SSM output projection +from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ). + +## The mechanism (profiled, both engines) + +Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391. +The largest llama-specific overage was the gated-DeltaNet OUTPUT projection +(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it +to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so +`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the +128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize +the ssm_out weight read across the 128 sequences. vLLM packs the same projection +into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ); +only the output projection was in 3D SSM layout. + +## The fix + +In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse +the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at +decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the +MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`, +so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a +2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already +proven by the in-projection. + +``` +- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); ++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); + ... + cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s); +- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); +``` + +## Gates (all PASS) + +- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the + post-SSM baseline build: + - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL) + - MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL) +- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK +- Coherent dense + MoE output (greedy text inspected). + +## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000) + +S_TG t/s (decode aggregate): + +| model | npl | baseline | Lever 1 | delta | +|------------------|-----|----------|---------|---------| +| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% | +| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% | +| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% | +| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% | + +Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded; +up from 65% post-SSM). + +## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128) + +The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128: + +| kernel | baseline | Lever 1 | +|-------------------------------------|--------------------|------------------| +| mul_mat_vec_q (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst | +| mul_mat_q | 5463 ms / 8800 inst| 5827 ms /10000 inst| + +The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it +(+1200 instances, +363 ms over the window), and its per-call average DROPS +(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper +than the average projection GEMM. Realized o_proj-as-MMQ marginal cost +~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the +old GEMV: the amortized weight read is the win. + +Assisted-by: Claude:opus-4.8 [Claude Code]