diff --git a/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch b/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch index 0a57d5270..a7e653d70 100644 --- a/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch +++ b/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch @@ -54,7 +54,6 @@ is now the FP4 GEMM (~48 percent of decode), a separate kernel track. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- - SSM_DECODE_FIX_RESULTS.md | 86 +++++++++++++++++++++++++++ ggml/include/ggml.h | 17 ++++++ ggml/src/ggml-cpu/ops.cpp | 49 ++++++++++++++- ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++---- @@ -63,102 +62,8 @@ Signed-off-by: Ettore Di Giacinto src/models/models.h | 13 ++++ src/models/qwen35.cpp | 6 +- src/models/qwen35moe.cpp | 6 +- - 9 files changed, 378 insertions(+), 23 deletions(-) + 8 files changed, 292 insertions(+), 23 deletions(-) -diff --git a/SSM_DECODE_FIX_RESULTS.md b/SSM_DECODE_FIX_RESULTS.md -index 2e7c8c2..77879e4 100644 ---- a/SSM_DECODE_FIX_RESULTS.md -+++ b/SSM_DECODE_FIX_RESULTS.md -@@ -96,3 +96,89 @@ precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLL - after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel - track. No paged/graph/block-table change can move decode on this model (full - attention is 0.4% of decode). -+ -+## STEP 2 (patch 0019): fuse the recurrent-state gather into the op -+ -+After Step 1 the largest non-GEMM decode bucket was the recurrent-state -+`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each -+sequence's prior state into a contiguous scratch via `ggml_get_rows` before the -+gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring -+`ggml_ssm_scan`'s `ids` source. -+ -+`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the -+`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids, -+`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from -+`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND -+writes the cache directly: no recurrent-state materialization at all. The -+`build_recurrent_attn` fused path feeds the full cache and ids through the -+`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the -+`rs_zero` clear and the extra-states copy around the op. -+ -+### Race-free by construction (CUDA) -+ -+In-place write plus an ids read of the same cache is only safe when the read slot -+equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity -+(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but -+can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one -+prefill ubatch). The kernel handles both per (seq, head) block on device: -+ -+- identity sequences read `s0` in place from the destination slot `state_dst` -+ (the kernel loads all of `s0` into registers before it writes the new state, -+ so reading and writing the same slot is race-free) -- no materialization; -+- non-identity sequences read from a disjoint scratch that a small -+ `gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the -+ recurrence never reads a slot another block writes. -+ -+`ids` stays a device pointer (dereferenced only in the kernels; the input is -+device-resident at op-execute time, so a host read segfaults). The CPU op -+mirrors the same logic (host identity check + a serial gather in the dispatcher -+for the non-identity case). The math is unchanged, so the result is bit-identical -+to the `get_rows` path in every case. -+ -+Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`, -+`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are -+untouched (they keep the materialized-state overload). -+ -+### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off) -+ -+Dense `q36-27b-nvfp4`: -+ -+| npl | Step 1 (baseline) | Step 2 | delta | % of vLLM (391 @128) | -+|-----|-------------------|----------|---------|----------------------| -+| 32 | 137.64 | 170.68 | +24.0% | - | -+| 128 | 186.25 | 256.57 | +37.8% | 47.6% -> 65.6% | -+ -+The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling. -+ -+MoE `q36-35b-a3b-nvfp4`: -+ -+| npl | Step 1 (baseline) | Step 2 | delta | -+|-----|-------------------|----------|---------| -+| 32 | 299.68 | 366.69 | +22.4% | -+| 128 | 409.30 | 553.63 | +35.3% | -+ -+(Step-1 baselines re-measured in the same session; the brief's reference figures -+were 136 / 180 dense and 279 / 373 MoE.) -+ -+### Bit-exact gate -+ -+Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on, -+fusion off) vs the Step-1 build: -+ -+- dense `q36-27b-nvfp4`: model text byte-identical (md5 match); -+- MoE `q36-35b-a3b-nvfp4`: byte-identical; -+- Step-2 dense run1 == run2 (deterministic, no race). -+ -+### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager) -+ -+The recurrent-state gather bucket collapsed: -+ -+| kernel | Step 1 | Step 2 | -+|----------------------------|----------|-----------------------------------------| -+| `k_get_rows_float` | 18.8% | 0.7% (residual: embeddings / conv-state)| -+| `gdn_gather_nonident` | - | 1.7% (no-op at decode, median ~1.2 us) | -+| `gated_delta_net_cuda` | 26.0% | 22.5% | -+| FP4 GEMM family | ~37.5% | ~48% (now the dominant residual) | -+ -+The SSM state gather is effectively eliminated. The residual decode gap to vLLM -+is now the FP4 GEMM (~48% of decode), a separate kernel track. diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h index 4e7ab32..951dd21 100644 --- a/ggml/include/ggml.h diff --git a/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch index 811061137..67333913c 100644 --- a/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch +++ b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch @@ -43,96 +43,11 @@ vs 2.77 ms/call for the old GEMV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- - LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++ src/models/qwen35.cpp | 13 ++++--- src/models/qwen35moe.cpp | 13 ++++--- src/models/qwen3next.cpp | 13 ++++--- - 4 files changed, 98 insertions(+), 18 deletions(-) - create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md + 3 files changed, 21 insertions(+), 18 deletions(-) -diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md -new file mode 100644 -index 0000000..9a5721f ---- /dev/null -+++ b/LEVER1_OPROJ_MMQ_RESULTS.md -@@ -0,0 +1,77 @@ -+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020) -+ -+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models -+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line, -+bit-exact tensor reshape that re-routes the per-layer SSM output projection -+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ). -+ -+## The mechanism (profiled, both engines) -+ -+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391. -+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection -+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it -+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so -+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the -+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize -+the ssm_out weight read across the 128 sequences. vLLM packs the same projection -+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ); -+only the output projection was in 3D SSM layout. -+ -+## The fix -+ -+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse -+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at -+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the -+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`, -+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a -+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already -+proven by the in-projection. -+ -+``` -+- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); -++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); -+ ... -+ cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s); -+- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); -+``` -+ -+## Gates (all PASS) -+ -+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the -+ post-SSM baseline build: -+ - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL) -+ - MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL) -+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK -+- Coherent dense + MoE output (greedy text inspected). -+ -+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000) -+ -+S_TG t/s (decode aggregate): -+ -+| model | npl | baseline | Lever 1 | delta | -+|------------------|-----|----------|---------|---------| -+| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% | -+| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% | -+| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% | -+| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% | -+ -+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded; -+up from 65% post-SSM). -+ -+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128) -+ -+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128: -+ -+| kernel | baseline | Lever 1 | -+|-------------------------------------|--------------------|------------------| -+| mul_mat_vec_q (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst | -+| mul_mat_q | 5463 ms / 8800 inst| 5827 ms /10000 inst| -+ -+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it -+(+1200 instances, +363 ms over the window), and its per-call average DROPS -+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper -+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost -+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the -+old GEMV: the amortized weight read is the win. -+ -+Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp index 0be3247..0874c43 100644 --- a/src/models/qwen35.cpp diff --git a/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch b/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch index a7f0c7d41..f61183cde 100644 --- a/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch +++ b/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch @@ -56,7 +56,6 @@ conv-cache plumbing. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- - CONV_STATE_FUSION_RESULTS.md | 106 +++++++++++++++++++++++++++++++ ggml/include/ggml.h | 16 +++++ ggml/src/ggml-cpu/ops.cpp | 73 ++++++++++++++++++++- ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++ @@ -67,121 +66,8 @@ Signed-off-by: Ettore Di Giacinto src/models/qwen35moe.cpp | 23 +++++-- src/models/qwen3next.cpp | 29 ++++++--- tests/test-backend-ops.cpp | 47 ++++++++++++++ - 11 files changed, 526 insertions(+), 22 deletions(-) - create mode 100644 CONV_STATE_FUSION_RESULTS.md + 10 files changed, 420 insertions(+), 22 deletions(-) -diff --git a/CONV_STATE_FUSION_RESULTS.md b/CONV_STATE_FUSION_RESULTS.md -new file mode 100644 -index 0000000..f59b6e5 ---- /dev/null -+++ b/CONV_STATE_FUSION_RESULTS.md -@@ -0,0 +1,106 @@ -+# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact) -+ -+The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3). -+After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at -+the f32 byte floor), the conv path was the only remaining bit-exact decode lever. -+ -+## What changed -+ -+A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a -+non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain: -+ -+ qkv_mixed transpose -> ggml_concat (build width-K window) [concat_cont 8.14 ms/step] -+ -> ggml_ssm_conv (depthwise conv) [ssm_conv_f32 ~8.6 ms/step] -+ -> ggml_silu [folded into ssm_conv on CUDA] -+ -> ggml_cpy of the shifted ring state into the conv cache [cpy_scalar 5.76 ms/step] -+ -+with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from -+the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME -+ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes -+the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot -+the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is -+vLLM's `causal_conv1d_update`. -+ -+Files: -+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace` -+ (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs], -+ src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu). -+- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32` (one thread per -+ (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of -+ `ggml_cuda_op_ssm_conv`. -+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over -+ channels) + branch in `ggml_compute_forward_ssm_conv`. -+- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs -+ conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write -+ target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed. -+- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path -+ (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/ -+ rollback keep the existing concat+ssm_conv+silu+cpy chain. -+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output -+ vs the CPU reference across backends. -+ -+## Gate: test-backend-ops (CUDA0 vs CPU reference) -+ -+- SSM_CONV: 45/45 OK (unchanged path intact) -+- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128) -+- SSM_CONV_BIAS_SILU: 90/90 OK -+ -+## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on) -+ -+Byte-identical to the clean Lever-1 (0019/0020) baseline, both models: -+ -+| model | baseline md5 | fused md5 | result | -+|--------------------|----------------------------------|----------------------------------|-----------------| -+| q36-27b-nvfp4 | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL | -+| q36-35b-a3b-nvfp4 | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL | -+ -+## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after -+ -+Dense q36-27b-nvfp4: -+ -+| mode | npl | baseline | fused | delta | -+|-----------|-----|----------|--------|---------| -+| CUDA-graph| 32 | 199.76 | 202.99 | +1.6% | -+| CUDA-graph| 128 | 336.35 | 347.14 | +3.2% | -+| eager | 32 | 196.07 | 197.61 | +0.8% | -+| eager | 128 | 333.62 | 342.97 | +2.8% | -+ -+MoE q36-35b-a3b-nvfp4: -+ -+| mode | npl | baseline | fused | delta | -+|-----------|-----|----------|--------|---------| -+| CUDA-graph| 32 | 421.72 | 432.39 | +2.5% | -+| CUDA-graph| 128 | 689.74 | 713.54 | +3.5% | -+| eager | 32 | 421.05 | 432.46 | +2.7% | -+| eager | 128 | 689.15 | 713.87 | +3.6% | -+ -+Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at -+**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes. -+ -+## Step time + nsys kernel delta -+ -+Per-step decode time (dense npl128, T_TG / ntg=128): -+- baseline 48.711 s / 128 = 380.6 ms/step -+- fused 47.197 s / 128 = 368.7 ms/step -> **-11.9 ms/step** (matches the predicted +12-14 ms) -+- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step) -+ -+nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels: -+ -+| kernel | baseline calls | fused calls | per-step (eager) | -+|---------------------|----------------|-------------|------------------| -+| concat_cont (decode)| 1152 | 0 (GONE) | 7.95 -> 0 ms | -+| cpy_scalar (decode) | 1152 of 3648 | 0 (GONE) | 4.29 -> 0 ms | -+| ssm_conv_f32 (decode)| 1152 of 2736 | 0 (prefill-only) | 8.65 -> 0 ms | -+| ssm_conv_update | 0 | 1152 | 0 -> 7.56 ms | -+ -+Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont -+and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel. -+prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged). -+ -+## Verdict -+ -+Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM -+391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free; -+de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches. -+ -+Assisted-by: Claude:opus-4.8 [Claude Code] diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h index 951dd21..76fa401 100644 --- a/ggml/include/ggml.h diff --git a/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch b/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch index e29f38c4b..a37395f92 100644 --- a/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch +++ b/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch @@ -46,219 +46,14 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- - LEVER1_GATHER_PROGRESS.md | 26 ++++++ - LEVER1_GATHER_RESULTS.md | 163 +++++++++++++++++++++++++++++++++ ggml/include/ggml.h | 20 ++++ ggml/src/ggml-cpu/ops.cpp | 90 +++++++++++++++++- ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++- ggml/src/ggml.c | 62 +++++++++++++ src/models/delta-net-base.cpp | 26 ++++-- tests/test-backend-ops.cpp | 69 ++++++++++++++ - 8 files changed, 600 insertions(+), 11 deletions(-) - create mode 100644 LEVER1_GATHER_PROGRESS.md - create mode 100644 LEVER1_GATHER_RESULTS.md + 6 files changed, 411 insertions(+), 11 deletions(-) -diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md -new file mode 100644 -index 0000000..e4d14b9 ---- /dev/null -+++ b/LEVER1_GATHER_PROGRESS.md -@@ -0,0 +1,26 @@ -+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE -+ -+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees. -+ -+## What -+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV -+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) + -+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction -+(read path gather -> indexed in-kernel read; values + reduction order unchanged). -+ -+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026) -+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline; -+ MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical). -+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45, -+ GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK. -+ -+## Bench (S_TG t/s, npp128 ntg128 npl 32/128) -+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39. -+- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56. -+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms; -+ step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel. -+ -+## Artifacts -+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree) -+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables) -+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt -diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md -new file mode 100644 -index 0000000..afced02 ---- /dev/null -+++ b/LEVER1_GATHER_RESULTS.md -@@ -0,0 +1,163 @@ -+# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact) -+ -+The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest -+kernel vLLM has no equivalent of (~5.2 ms/step MoE decode; also present in dense): vLLM updates its -+gated-DeltaNet recurrent state in-place inside the fused decode kernel, while llama ran a separate -+`ggml_get_rows` gather. Patch 0019 fused the SSM recurrent-state gather; patch 0021 fused the conv -+compute/write-back but KEPT a `build_rs` gather for the conv taps ("tiny; not one of the eliminated -+buckets"). This patch closes that residual. -+ -+## Which gather was still firing (nsys-located, DGX GB10 sm_121) -+ -+Profiled MoE `q36-35b-a3b-nvfp4` at batch-128 decode (`llama-batched-bench -npp128 -ntg24 -npl128 -+-fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`). The decode-window `k_get_rows_float` -+distribution was bimodal: a BIG cluster of **~720 instances (= 30 GDN layers x 24 decode steps) at -+~115 us each** plus small embedding/router gathers. -+ -+The big gather's geometry (`grid=(ne10=128, block_num_y=96, 1)`) decodes to **128 rows (= n_seqs -+active sequences) of ne00 = 24576 floats**. With the model's real dims (`d_conv=4, d_inner=4096, -+n_group=16, d_state=128`): -+- `n_embd_r = (d_conv-1) * (d_inner + 2*n_group*d_state) = 3 * 8192 = 24576` -> `block_num_y=96` EXACT match. -+- `n_embd_s = d_state * d_inner = 524288` (the SSM state, gridY 2048 - already fused by 0019). -+ -+So the residual `k_get_rows` is the **conv-state tap gather** in `build_conv_state_fused` -+(`src/models/delta-net-base.cpp`), which called the plain 4-arg `build_rs` -> `ggml_get_rows` of the -+24576-float conv-state row x 128 sequences, once per GDN layer per decode step (~3.4 ms/step here, -+~5.2 ms/step at steady ntg=128). The SSM-state gather is already fused, so this conv gather is the -+last `k_get_rows` in the GDN decode path. -+ -+## What changed (mirror of the 0019 SSM gather fusion; bit-exact by construction) -+ -+New op `ggml_ssm_conv_update_inplace_ids` (reuses `GGML_OP_SSM_CONV`, discriminated by a non-null -+`src[4]` = ids). Instead of a pre-gathered tap scratch, it takes the FULL conv-state cache (`src[0]`) -+plus the per-sequence `ids` (= the recurrent-state `s_copy`, `src[4]`; `op_params[1]=rs_head`) and -+reads each active sequence's prior K-1 taps directly from `cache[ids[s]]` in the kernel. This removes -+the separate `k_get_rows` launch. -+ -+Race-free, exactly mirroring 0019: -+- **Identity** sequences (`ids[s] == rs_head + s`, the whole AR-decode path) read the taps in place -+ from the `conv_state_dst` write slot. The kernel loads the full conv window into registers before -+ it writes the 1-token-shifted ring back, so read==write slot is race-free per (channel, seq) thread. -+- **Non-identity** sequences (reorder / `rs_zero` remap at a prefill->decode boundary) are gathered -+ into a disjoint scratch by a small `ssm_conv_gather_nonident_kernel` first (no-op at steady decode), -+ so the update kernel never reads a slot another block writes. -+ -+The read VALUES are unchanged (identity in-place taps == the gathered taps == `cache[ids[s]]`); only -+the read PATH changes from a `ggml_get_rows` materialization to an indexed in-kernel read. The conv -+math, ascending-tap FMA order, silu and the ring write-back are byte-identical to 0021. -+ -+Files: -+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: `ggml_ssm_conv_update_inplace_ids` builder -+ (src[0]=full cache [K-1,channels,n_cells], src[1]=conv_kernel, src[2]=x_cur, src[3]=conv_state_dst, -+ src[4]=ids; op_params[0]=fuse_silu, op_params[1]=rs_head). -+- `ggml/src/ggml-cuda/ssm-conv.cu`: `ssm_conv_gather_nonident_kernel` + `ssm_conv_update_ids_f32` -+ kernel + `ggml_cuda_op_ssm_conv_update_ids` + a `src[4]`-discriminated branch in `ggml_cuda_op_ssm_conv`. -+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_ids_f32` (window copied to a -+ local before the possibly-aliasing write) + dispatch branch. -+- `src/models/delta-net-base.cpp`: `build_conv_state_fused` now feeds the FULL cache + ids through the -+ `build_rs` `get_state_rows` lambda (the rs_zero clear + extra-states copy still run around it), -+ exactly like the 0019 recurrent-attn fusion. The `qwen35` / `qwen35moe` / `qwen3next` callers are -+ unchanged (they already route the single-token decode path here). -+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update_ids` (16 cases) - ids = a shuffled permutation -+ with `rs_head=0`, so each case exercises BOTH the identity in-place read and the non-identity cache -+ read; validates the conv+silu output vs the CPU reference. -+ -+## GATE: test-backend-ops (CUDA0 vs CPU, 2/2 backends) -+ -+- SSM_CONV_UPDATE_IDS: OK (NEW; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128) -+- SSM_CONV_UPDATE: OK (0021 path intact) -+- SSM_CONV: OK -+- GATED_DELTA_NET: OK -+- GET_ROWS: OK -+ -+## GATE: greedy bit-exactness (--temp 0 --seed 1 -n 48, -fa on) - BOTH models BYTE-IDENTICAL -+ -+| model | baseline md5 | 0028 md5 | result | -+|--------------------|----------------------------------|----------------------------------|-----------------| -+| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | BYTE-IDENTICAL | -+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | BYTE-IDENTICAL | -+ -+(Built on the `paged` branch f32-default = 0026 hybrid default is f32; the baseline was re-confirmed -+on the same build before the edit.) -+ -+## nsys proof - the gather is eliminated (MoE decode, npp128 ntg24 npl128, same window) -+ -+| kernel | before | after | -+|-------------------------------------|---------------|-------------------------------| -+| `k_get_rows_float` cnt | 10174 | 9454 (720 fewer = 30 GDN x 24)| -+| `k_get_rows_float` sum | 186.3 ms | 102.8 ms (-83.5 ms) | -+| conv update kernel | `ssm_conv_update_f32` 720 | `ssm_conv_update_ids_f32` 720 | -+| `ssm_conv_gather_nonident_kernel` | - | 720 x ~1.1 us = 0.8 ms (no-op at decode) | -+ -+The 720 big ~115 us conv gathers are gone; the only added work is a ~1.1 us no-op gather kernel per -+layer-step (all sequences identity during steady AR decode). This matches 0019's "no-op at decode, -+median ~1.2 us" non-identity gather. -+ -+## Preliminary throughput (post-fusion, single point; rigorous A/B is the bench phase) -+ -+- MoE `q36-35b-a3b-nvfp4` npl128 (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`): **783.9 t/s**, step -+ 163.3 ms/step (MOE_GAP @0025 was 752.3 t/s / 169.8 ms/step => -6.5 ms/step in this stack). -+- dense `q36-27b-nvfp4` npl128: **377.3 t/s** (~96% of vLLM 391; includes 0022/0026 base gains). -+- npl128 ran clean (EXIT=0) on both - the non-identity boundary path does not crash. -+ -+## Verdict -+ -+Bit-exact (both md5 gates byte-identical, all test-backend-ops pass), the residual `k_get_rows` conv -+gather is eliminated (nsys-confirmed), and decode throughput improves. Helps BOTH dense and MoE (the -+shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path (after 0019 SSM-state -++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench. -+ -+Assisted-by: Claude:opus-4.8 [Claude Code] -+ -+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121) -+ -+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65; -+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back. -+ -+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline -+ -+| model | base (0026) | lever1 (0028) | recorded baseline | -+|-------------------|----------------------------------|----------------------------------|----------------------------------| -+| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | -+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | -+ -+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45, -+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK. -+ -+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000 -+ -+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1): -+ -+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 | -+|-----|-----------|-------------|--------|----------------| -+| 32 | 208.56 | 209.39 | +0.40% | - | -+| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% | -+ -+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1): -+ -+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 | -+|-----|-----------|-------------|--------|----------------| -+| 32 | 456.85 | 459.56 | +0.59% | - | -+| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% | -+ -+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step). -+ -+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated -+ -+| kernel | base (0026) | lever1 (0028) | -+|---------------------------------|------------------------|----------------------------------------------| -+| k_get_rows_float | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms | -+| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms | -+| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) | -+| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) | -+ -+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded -+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching -+the -3.13 ms/step throughput delta at npl128. -+ -+### Verdict (gather-bench) -+ -+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv -+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode -+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM), -+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it. diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h index 2a5cbce..5fa220a 100644 --- a/ggml/include/ggml.h diff --git a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md index 5ba621116..cc846e8f5 100644 --- a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md +++ b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md @@ -213,17 +213,87 @@ all 23 patches, and the resulting tree is **byte-identical to the gate-green shipped `.patch` series reproduces exactly the tree that passed test-backend-ops, the md5 bit-exact gate, and the bench. -## Pre-existing finding (NOT introduced by this pin-sync, NOT fixed here) -Committed patch `0019` carries a *modify* hunk against the dev-only doc -`SSM_DECODE_FIX_RESULTS.md` (`index 2e7c8c2..77879e4 100644`), a file that exists -only because of an unshipped docs commit on the dev tree and is absent from a -clean llama.cpp checkout. Under strict `git apply` that hunk fails ("No such file -or directory"). This is pin-independent (the file is upstream-absent on both -`8be759e6` and `9d5d882d`) and present identically in the old and new `0019` -(LINENUM class), so it is left untouched to keep the pin-sync faithful. (`0021`'s -`CONV_STATE_FUSION_RESULTS.md` is a *create* hunk and applies fine.) Stripping the -stray dev-doc hunks from the shipped patches is a separate cleanup, out of scope -for the pin-sync. +## Shipped-build bug FIXED: stray dev-doc hunks stripped from the patch series + +The pin-sync export captured dev-only result/progress docs that live in the DGX +dev tree (`~/llama-paged-dev`) but are ABSENT from a clean `ggml-org/llama.cpp` +checkout. The shipped build applies the paged series with **strict `git apply`** +(the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`: +`git apply --verbose "$p" || { echo "paged patch failed"; exit 1; }`), which is +atomic: a single hunk against a missing file REJECTS the entire patch and the +`exit 1` fails the build. (`prepare.sh` uses tolerant `patch -pN -N ... || true`, +but it is guarded by the `src/paged-kv-manager.cpp` sentinel and skipped at build +time once the Makefile has applied the series, so the strict `git apply` is the +real shipped path.) + +Root failure was patch `0019`'s *modify* hunk against `SSM_DECODE_FIX_RESULTS.md` +(`index 2e7c8c2..77879e4 100644`): on a clean tree `git apply` cannot find the +file to modify ("No such file or directory") and rejects all of `0019`, which +then cascades to `0021`/`0022`/`0026`/`0028` (they build on `0019`'s code). The +build therefore only succeeded on the DGX (where the doc exists) and FAILED on CI +/ any clean checkout. + +Fixed by stripping every stray non-source hunk so the patches contain ONLY +llama.cpp source changes. Stripped hunks (dev docs absent from a clean +`9d5d882d` checkout): + +| patch | stripped dev-doc hunk(s) | hunk kind | +|-------|--------------------------|-----------| +| `0019` | `SSM_DECODE_FIX_RESULTS.md` | modify (the root reject) | +| `0020` | `LEVER1_OPROJ_MMQ_RESULTS.md` | create | +| `0021` | `CONV_STATE_FUSION_RESULTS.md` | create | +| `0028` | `LEVER1_GATHER_PROGRESS.md`, `LEVER1_GATHER_RESULTS.md` | create | + +(The `create` hunks did not reject on their own - `git apply` will create a new +file even on a clean tree - but they polluted the build tree with stray dev docs +and violated the source-only invariant, so they were stripped too.) For each +patch the `diff --git a/ ...` section was removed along with its diffstat +per-file line, any `create mode` trailer, and the `N files changed, ...` summary +was corrected; **every llama.cpp SOURCE hunk is byte-identical** (verified by +sha256 of each patch's source-diff tail before vs after the strip). + +Verified on a fresh `git clone` of `ggml-org/llama.cpp` at this pin `9d5d882d`: +- BEFORE the strip, strict `git apply` of the series: OK through `0018`, then + `0019` FAILS ("SSM_DECODE_FIX_RESULTS.md: No such file or directory") -> the + Makefile `exit 1`s; continue-mode shows the full cascade `0019` `0021` `0022` + `0026` `0028` failing. +- AFTER the strip, strict `git apply` of the full series `0001..0030` reaches + **exit 0** (every patch OK, sentinel `src/paged-kv-manager.cpp` created, zero + stray `*_RESULTS.md`/`*_PROGRESS.md` in the tree). The tolerant `patch -p1` + path (prepare.sh fallback) also applies with zero rejects. + +## Durable fix: keep patch exports SOURCE-ONLY + +The pin-sync / re-export step MUST NOT capture dev-only artifacts into the shipped +`.patch` files. A clean `ggml-org/llama.cpp` checkout contains its own real docs +(`README.md`, `docs/`, `AGENTS.md`, ...) but NOT LocalAI dev notes - anything +matching `*_RESULTS.md`, `*_PROGRESS.md`, `*.diff`, `final_benchmark.csv`, +`LEVER*`, `BENCH*`, `paged-*-bench.cpp`, or any path that does not exist at the +pin is a dev artifact and must be excluded. Concretely, when re-exporting: + +- prefer `git format-patch -1 -- ':!*.md' ':!*.diff' ':!*.csv'` (or an + explicit pathspec of the llama.cpp source dirs `src/ ggml/ common/ include/ + tools/ tests/ cmake/`) so dev docs never enter the patch body; +- keep the dev-notes commits SEPARATE from the code commits on the dev branch, so + a per-commit export is naturally source-only; +- after export, gate with: clone the pin, `git apply` the full series with strict + (no-`--exclude`, no `|| true`) `git apply` - it MUST reach exit 0. The weekly + canary (`.github/workflows/llama-cpp-paged-canary.yml`) does this against + upstream HEAD; now that the patches are source-only its `0019` + `SSM_DECODE_FIX_RESULTS.md` `--exclude` workaround + (`.github/scripts/paged-canary-apply.sh`) is no longer needed and can be removed + on the next canary touch. + +The upcoming `c299a92c` pin-bump re-export MUST follow this: produce source-only +patches and pass the strict-`git apply` gate on a clean checkout before advancing +the pin. + +## Historical note (pre-strip) +Before this cleanup, `0019` carried the `SSM_DECODE_FIX_RESULTS.md` modify hunk +identically in the old and new exports (LINENUM class) and was left untouched +during the pin-sync to keep the rebase faithful; `0021`'s +`CONV_STATE_FUSION_RESULTS.md` was a create hunk that applied but still leaked a +dev doc. Both are now removed by the source-only strip above. ## Source of truth The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD