fix(paged): strip stray dev-doc hunks so patch series applies on a clean checkout

The shipped from-patches build applies the paged series with strict `git apply` (backend/cpp/llama-cpp/Makefile `llama.cpp` target: `git apply --verbose "$p" || { ...; exit 1; }`), which is atomic: a hunk against a file missing from the tree rejects the whole patch and fails the build. Four patches carried hunks against dev-only docs that live in the DGX dev tree but are absent from a clean ggml-org/llama.cpp checkout, so the build only succeeded on the DGX and FAILED on CI / any clean checkout: 0019 -> SSM_DECODE_FIX_RESULTS.md (modify hunk = the root reject) 0020 -> LEVER1_OPROJ_MMQ_RESULTS.md (create) 0021 -> CONV_STATE_FUSION_RESULTS.md (create) 0028 -> LEVER1_GATHER_PROGRESS.md, LEVER1_GATHER_RESULTS.md (create) 0019's reject cascaded to 0021/0022/0026/0028 (which build on 0019's code). Strip each `diff --git a/<devdoc>` section plus its diffstat line, `create mode` trailer, and correct the summary count. Every llama.cpp SOURCE hunk is left byte-identical (verified by sha256 of each patch's source-diff tail). Verified on a fresh clone of ggml-org/llama.cpp at the pin 9d5d882d: BEFORE, strict `git apply` failed at 0019 (cascade 0019/0021/0022/0026/0028); AFTER, the full series 0001-0030 applies with exit 0 (sentinel created, zero stray docs). The tolerant `patch -p1` fallback in prepare.sh also applies with zero rejects. PIN_SYNC_9d5d882d.md documents the durable fix: re-exports/pin-syncs must keep patches source-only (export with a source pathspec / `:!*.md`, gate with a strict `git apply` on a clean checkout). The upcoming c299a92c pin-bump re-export must produce source-only patches too. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:57:14 -04:00 · 2026-06-27 08:39:27 +00:00
parent 2bee7a5ab1
commit 7e1832b868
5 changed files with 85 additions and 514 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
@@ -54,7 +54,6 @@ is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- SSM_DECODE_FIX_RESULTS.md             | 86 +++++++++++++++++++++++++++
 ggml/include/ggml.h                   | 17 ++++++
 ggml/src/ggml-cpu/ops.cpp             | 49 ++++++++++++++-
 ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
@@ -63,102 +62,8 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 src/models/models.h                   | 13 ++++
 src/models/qwen35.cpp                 |  6 +-
 src/models/qwen35moe.cpp              |  6 +-
- 9 files changed, 378 insertions(+), 23 deletions(-)
+ 8 files changed, 292 insertions(+), 23 deletions(-)

-diff --git a/SSM_DECODE_FIX_RESULTS.md b/SSM_DECODE_FIX_RESULTS.md
-index 2e7c8c2..77879e4 100644
--- a/SSM_DECODE_FIX_RESULTS.md
-+++ b/SSM_DECODE_FIX_RESULTS.md
-@@ -96,3 +96,89 @@ precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLL
- after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
- track. No paged/graph/block-table change can move decode on this model (full
- attention is 0.4% of decode).
-+
-+## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
-+
-+After Step 1 the largest non-GEMM decode bucket was the recurrent-state
-+`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
-+sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
-+gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
-+`ggml_ssm_scan`'s `ids` source.
-+
-+`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
-+`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
-+`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
-+`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
-+writes the cache directly: no recurrent-state materialization at all. The
-+`build_recurrent_attn` fused path feeds the full cache and ids through the
-+`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
-+`rs_zero` clear and the extra-states copy around the op.
-+
-+### Race-free by construction (CUDA)
-+
-+In-place write plus an ids read of the same cache is only safe when the read slot
-+equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
-+(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
-+can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
-+prefill ubatch). The kernel handles both per (seq, head) block on device:
-+
-+- identity sequences read `s0` in place from the destination slot `state_dst`
-+  (the kernel loads all of `s0` into registers before it writes the new state,
-+  so reading and writing the same slot is race-free) -- no materialization;
-+- non-identity sequences read from a disjoint scratch that a small
-+  `gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
-+  recurrence never reads a slot another block writes.
-+
-+`ids` stays a device pointer (dereferenced only in the kernels; the input is
-+device-resident at op-execute time, so a host read segfaults). The CPU op
-+mirrors the same logic (host identity check + a serial gather in the dispatcher
-+for the non-identity case). The math is unchanged, so the result is bit-identical
-+to the `get_rows` path in every case.
-+
-+Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
-+`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
-+untouched (they keep the materialized-state overload).
-+
-+### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
-+
-+Dense `q36-27b-nvfp4`:
-+
-+| npl | Step 1 (baseline) | Step 2   | delta   | % of vLLM (391 @128) |
-+|-----|-------------------|----------|---------|----------------------|
-+| 32  | 137.64            | 170.68   | +24.0%  | -                    |
-+| 128 | 186.25            | 256.57   | +37.8%  | 47.6% -> 65.6%       |
-+
-+The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
-+
-+MoE `q36-35b-a3b-nvfp4`:
-+
-+| npl | Step 1 (baseline) | Step 2   | delta   |
-+|-----|-------------------|----------|---------|
-+| 32  | 299.68            | 366.69   | +22.4%  |
-+| 128 | 409.30            | 553.63   | +35.3%  |
-+
-+(Step-1 baselines re-measured in the same session; the brief's reference figures
-+were 136 / 180 dense and 279 / 373 MoE.)
-+
-+### Bit-exact gate
-+
-+Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
-+fusion off) vs the Step-1 build:
-+
-+- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
-+- MoE `q36-35b-a3b-nvfp4`: byte-identical;
-+- Step-2 dense run1 == run2 (deterministic, no race).
-+
-+### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
-+
-+The recurrent-state gather bucket collapsed:
-+
-+| kernel                     | Step 1   | Step 2                                  |
-+|----------------------------|----------|-----------------------------------------|
-+| `k_get_rows_float`         | 18.8%    | 0.7% (residual: embeddings / conv-state)|
-+| `gdn_gather_nonident`      | -        | 1.7% (no-op at decode, median ~1.2 us)  |
-+| `gated_delta_net_cuda`     | 26.0%    | 22.5%                                    |
-+| FP4 GEMM family            | ~37.5%   | ~48% (now the dominant residual)        |
-+
-+The SSM state gather is effectively eliminated. The residual decode gap to vLLM
-+is now the FP4 GEMM (~48% of decode), a separate kernel track.
 diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
 index 4e7ab32..951dd21 100644
 --- a/ggml/include/ggml.h
--- a/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
@@ -43,96 +43,11 @@ vs 2.77 ms/call for the old GEMV.
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++
 src/models/qwen35.cpp       | 13 ++++---
 src/models/qwen35moe.cpp    | 13 ++++---
 src/models/qwen3next.cpp    | 13 ++++---
- 4 files changed, 98 insertions(+), 18 deletions(-)
- create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md
+ 3 files changed, 21 insertions(+), 18 deletions(-)

-diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md
-new file mode 100644
-index 0000000..9a5721f
--- /dev/null
-+++ b/LEVER1_OPROJ_MMQ_RESULTS.md
-@@ -0,0 +1,77 @@
-+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
-+
-+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
-+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
-+bit-exact tensor reshape that re-routes the per-layer SSM output projection
-+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
-+
-+## The mechanism (profiled, both engines)
-+
-+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
-+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
-+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
-+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
-+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
-+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
-+the ssm_out weight read across the 128 sequences. vLLM packs the same projection
-+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
-+only the output projection was in 3D SSM layout.
-+
-+## The fix
-+
-+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
-+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
-+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
-+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
-+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
-+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
-+proven by the in-projection.
-+
-+```
-+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-++    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-+     ...
-+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-+```
-+
-+## Gates (all PASS)
-+
-+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
-+  post-SSM baseline build:
-+  - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
-+  - MoE   q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
-+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
-+- Coherent dense + MoE output (greedy text inspected).
-+
-+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
-+
-+S_TG t/s (decode aggregate):
-+
-+| model            | npl | baseline | Lever 1 | delta   |
-+|------------------|-----|----------|---------|---------|
-+| dense q36-27b    |  32 |   170.52 |  200.00 | +17.3%  |
-+| dense q36-27b    | 128 |   254.92 |  335.80 | +31.7%  |
-+| MoE   q36-35b-a3b|  32 |   373.28 |  420.77 | +12.7%  |
-+| MoE   q36-35b-a3b| 128 |   560.66 |  691.24 | +23.3%  |
-+
-+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
-+up from 65% post-SSM).
-+
-+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
-+
-+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
-+
-+| kernel                              | baseline           | Lever 1          |
-+|-------------------------------------|--------------------|------------------|
-+| mul_mat_vec_q<NVFP4, m=1> (o_proj)  | 132.8 ms / 48 inst | 0 ms / 0 inst    |
-+| mul_mat_q<NVFP4, m=128>             | 5463 ms / 8800 inst| 5827 ms /10000 inst|
-+
-+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
-+(+1200 instances, +363 ms over the window), and its per-call average DROPS
-+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
-+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
-+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
-+old GEMV: the amortized weight read is the win.
-+
-+Assisted-by: Claude:opus-4.8 [Claude Code]
 diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
 index 0be3247..0874c43 100644
 --- a/src/models/qwen35.cpp
--- a/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
@@ -56,7 +56,6 @@ conv-cache plumbing.
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- CONV_STATE_FUSION_RESULTS.md   | 106 +++++++++++++++++++++++++++++++
 ggml/include/ggml.h            |  16 +++++
 ggml/src/ggml-cpu/ops.cpp      |  73 ++++++++++++++++++++-
 ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++
@@ -67,121 +66,8 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 src/models/qwen35moe.cpp       |  23 +++++--
 src/models/qwen3next.cpp       |  29 ++++++---
 tests/test-backend-ops.cpp     |  47 ++++++++++++++
- 11 files changed, 526 insertions(+), 22 deletions(-)
- create mode 100644 CONV_STATE_FUSION_RESULTS.md
+ 10 files changed, 420 insertions(+), 22 deletions(-)

-diff --git a/CONV_STATE_FUSION_RESULTS.md b/CONV_STATE_FUSION_RESULTS.md
-new file mode 100644
-index 0000000..f59b6e5
--- /dev/null
-+++ b/CONV_STATE_FUSION_RESULTS.md
-@@ -0,0 +1,106 @@
-+# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
-+
-+The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
-+After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
-+the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
-+
-+## What changed
-+
-+A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
-+non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
-+
-+    qkv_mixed transpose -> ggml_concat (build width-K window)   [concat_cont 8.14 ms/step]
-+    -> ggml_ssm_conv (depthwise conv)                           [ssm_conv_f32 ~8.6 ms/step]
-+    -> ggml_silu                                                [folded into ssm_conv on CUDA]
-+    -> ggml_cpy of the shifted ring state into the conv cache   [cpy_scalar 5.76 ms/step]
-+
-+with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
-+the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
-+ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
-+the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
-+the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
-+vLLM's `causal_conv1d_update`.
-+
-+Files:
-+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
-+  (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
-+  src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
-+- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
-+  (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
-+  `ggml_cuda_op_ssm_conv`.
-+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
-+  channels) + branch in `ggml_compute_forward_ssm_conv`.
-+- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
-+  conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
-+  target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
-+- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
-+  (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
-+  rollback keep the existing concat+ssm_conv+silu+cpy chain.
-+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
-+  vs the CPU reference across backends.
-+
-+## Gate: test-backend-ops (CUDA0 vs CPU reference)
-+
-+- SSM_CONV: 45/45 OK (unchanged path intact)
-+- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
-+- SSM_CONV_BIAS_SILU: 90/90 OK
-+
-+## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
-+
-+Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
-+
-+| model              | baseline md5                     | fused md5                        | result          |
-+|--------------------|----------------------------------|----------------------------------|-----------------|
-+| q36-27b-nvfp4      | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL  |
-+| q36-35b-a3b-nvfp4  | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL  |
-+
-+## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
-+
-+Dense q36-27b-nvfp4:
-+
-+| mode      | npl | baseline | fused  | delta   |
-+|-----------|-----|----------|--------|---------|
-+| CUDA-graph| 32  | 199.76   | 202.99 | +1.6%   |
-+| CUDA-graph| 128 | 336.35   | 347.14 | +3.2%   |
-+| eager     | 32  | 196.07   | 197.61 | +0.8%   |
-+| eager     | 128 | 333.62   | 342.97 | +2.8%   |
-+
-+MoE q36-35b-a3b-nvfp4:
-+
-+| mode      | npl | baseline | fused  | delta   |
-+|-----------|-----|----------|--------|---------|
-+| CUDA-graph| 32  | 421.72   | 432.39 | +2.5%   |
-+| CUDA-graph| 128 | 689.74   | 713.54 | +3.5%   |
-+| eager     | 32  | 421.05   | 432.46 | +2.7%   |
-+| eager     | 128 | 689.15   | 713.87 | +3.6%   |
-+
-+Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
-+**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
-+
-+## Step time + nsys kernel delta
-+
-+Per-step decode time (dense npl128, T_TG / ntg=128):
-+- baseline 48.711 s / 128 = 380.6 ms/step
-+- fused    47.197 s / 128 = 368.7 ms/step  -> **-11.9 ms/step** (matches the predicted +12-14 ms)
-+- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
-+
-+nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
-+
-+| kernel              | baseline calls | fused calls | per-step (eager) |
-+|---------------------|----------------|-------------|------------------|
-+| concat_cont (decode)| 1152           | 0 (GONE)    | 7.95 -> 0 ms     |
-+| cpy_scalar (decode) | 1152 of 3648   | 0 (GONE)    | 4.29 -> 0 ms     |
-+| ssm_conv_f32 (decode)| 1152 of 2736  | 0 (prefill-only) | 8.65 -> 0 ms |
-+| ssm_conv_update     | 0              | 1152        | 0 -> 7.56 ms     |
-+
-+Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
-+and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
-+prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
-+
-+## Verdict
-+
-+Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
-+391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
-+de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
-+
-+Assisted-by: Claude:opus-4.8 [Claude Code]
 diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
 index 951dd21..76fa401 100644
 --- a/ggml/include/ggml.h
--- a/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
@@ -46,219 +46,14 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- LEVER1_GATHER_PROGRESS.md      |  26 ++++++
- LEVER1_GATHER_RESULTS.md       | 163 +++++++++++++++++++++++++++++++++
 ggml/include/ggml.h            |  20 ++++
 ggml/src/ggml-cpu/ops.cpp      |  90 +++++++++++++++++-
 ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
 ggml/src/ggml.c                |  62 +++++++++++++
 src/models/delta-net-base.cpp  |  26 ++++--
 tests/test-backend-ops.cpp     |  69 ++++++++++++++
- 8 files changed, 600 insertions(+), 11 deletions(-)
- create mode 100644 LEVER1_GATHER_PROGRESS.md
- create mode 100644 LEVER1_GATHER_RESULTS.md
+ 6 files changed, 411 insertions(+), 11 deletions(-)

-diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md
-new file mode 100644
-index 0000000..e4d14b9
--- /dev/null
-+++ b/LEVER1_GATHER_PROGRESS.md
-@@ -0,0 +1,26 @@
-+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
-+
-+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
-+
-+## What
-+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
-+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
-+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
-+(read path gather -> indexed in-kernel read; values + reduction order unchanged).
-+
-+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
-+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
-+  MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
-+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
-+  GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
-+
-+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
-+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
-+- MoE   npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
-+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
-+  step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
-+
-+## Artifacts
-+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
-+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
-+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
-diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md
-new file mode 100644
-index 0000000..afced02
--- /dev/null
-+++ b/LEVER1_GATHER_RESULTS.md
-@@ -0,0 +1,163 @@
-+# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
-+
-+The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
-+kernel vLLM has no equivalent of (~5.2 ms/step MoE decode; also present in dense): vLLM updates its
-+gated-DeltaNet recurrent state in-place inside the fused decode kernel, while llama ran a separate
-+`ggml_get_rows` gather. Patch 0019 fused the SSM recurrent-state gather; patch 0021 fused the conv
-+compute/write-back but KEPT a `build_rs` gather for the conv taps ("tiny; not one of the eliminated
-+buckets"). This patch closes that residual.
-+
-+## Which gather was still firing (nsys-located, DGX GB10 sm_121)
-+
-+Profiled MoE `q36-35b-a3b-nvfp4` at batch-128 decode (`llama-batched-bench -npp128 -ntg24 -npl128
-+-fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`). The decode-window `k_get_rows_float<float,float>`
-+distribution was bimodal: a BIG cluster of **~720 instances (= 30 GDN layers x 24 decode steps) at
-+~115 us each** plus small embedding/router gathers.
-+
-+The big gather's geometry (`grid=(ne10=128, block_num_y=96, 1)`) decodes to **128 rows (= n_seqs
-+active sequences) of ne00 = 24576 floats**. With the model's real dims (`d_conv=4, d_inner=4096,
-+n_group=16, d_state=128`):
-+- `n_embd_r = (d_conv-1) * (d_inner + 2*n_group*d_state) = 3 * 8192 = 24576` -> `block_num_y=96` EXACT match.
-+- `n_embd_s = d_state * d_inner = 524288` (the SSM state, gridY 2048 - already fused by 0019).
-+
-+So the residual `k_get_rows` is the **conv-state tap gather** in `build_conv_state_fused`
-+(`src/models/delta-net-base.cpp`), which called the plain 4-arg `build_rs` -> `ggml_get_rows` of the
-+24576-float conv-state row x 128 sequences, once per GDN layer per decode step (~3.4 ms/step here,
-+~5.2 ms/step at steady ntg=128). The SSM-state gather is already fused, so this conv gather is the
-+last `k_get_rows` in the GDN decode path.
-+
-+## What changed (mirror of the 0019 SSM gather fusion; bit-exact by construction)
-+
-+New op `ggml_ssm_conv_update_inplace_ids` (reuses `GGML_OP_SSM_CONV`, discriminated by a non-null
-+`src[4]` = ids). Instead of a pre-gathered tap scratch, it takes the FULL conv-state cache (`src[0]`)
-+plus the per-sequence `ids` (= the recurrent-state `s_copy`, `src[4]`; `op_params[1]=rs_head`) and
-+reads each active sequence's prior K-1 taps directly from `cache[ids[s]]` in the kernel. This removes
-+the separate `k_get_rows` launch.
-+
-+Race-free, exactly mirroring 0019:
-+- **Identity** sequences (`ids[s] == rs_head + s`, the whole AR-decode path) read the taps in place
-+  from the `conv_state_dst` write slot. The kernel loads the full conv window into registers before
-+  it writes the 1-token-shifted ring back, so read==write slot is race-free per (channel, seq) thread.
-+- **Non-identity** sequences (reorder / `rs_zero` remap at a prefill->decode boundary) are gathered
-+  into a disjoint scratch by a small `ssm_conv_gather_nonident_kernel` first (no-op at steady decode),
-+  so the update kernel never reads a slot another block writes.
-+
-+The read VALUES are unchanged (identity in-place taps == the gathered taps == `cache[ids[s]]`); only
-+the read PATH changes from a `ggml_get_rows` materialization to an indexed in-kernel read. The conv
-+math, ascending-tap FMA order, silu and the ring write-back are byte-identical to 0021.
-+
-+Files:
-+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: `ggml_ssm_conv_update_inplace_ids` builder
-+  (src[0]=full cache [K-1,channels,n_cells], src[1]=conv_kernel, src[2]=x_cur, src[3]=conv_state_dst,
-+  src[4]=ids; op_params[0]=fuse_silu, op_params[1]=rs_head).
-+- `ggml/src/ggml-cuda/ssm-conv.cu`: `ssm_conv_gather_nonident_kernel` + `ssm_conv_update_ids_f32`
-+  kernel + `ggml_cuda_op_ssm_conv_update_ids` + a `src[4]`-discriminated branch in `ggml_cuda_op_ssm_conv`.
-+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_ids_f32` (window copied to a
-+  local before the possibly-aliasing write) + dispatch branch.
-+- `src/models/delta-net-base.cpp`: `build_conv_state_fused` now feeds the FULL cache + ids through the
-+  `build_rs` `get_state_rows` lambda (the rs_zero clear + extra-states copy still run around it),
-+  exactly like the 0019 recurrent-attn fusion. The `qwen35` / `qwen35moe` / `qwen3next` callers are
-+  unchanged (they already route the single-token decode path here).
-+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update_ids` (16 cases) - ids = a shuffled permutation
-+  with `rs_head=0`, so each case exercises BOTH the identity in-place read and the non-identity cache
-+  read; validates the conv+silu output vs the CPU reference.
-+
-+## GATE: test-backend-ops (CUDA0 vs CPU, 2/2 backends)
-+
-+- SSM_CONV_UPDATE_IDS: OK (NEW; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
-+- SSM_CONV_UPDATE: OK (0021 path intact)
-+- SSM_CONV: OK
-+- GATED_DELTA_NET: OK
-+- GET_ROWS: OK
-+
-+## GATE: greedy bit-exactness (--temp 0 --seed 1 -n 48, -fa on) - BOTH models BYTE-IDENTICAL
-+
-+| model              | baseline md5                     | 0028 md5                         | result          |
-+|--------------------|----------------------------------|----------------------------------|-----------------|
-+| q36-27b-nvfp4      | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | BYTE-IDENTICAL  |
-+| q36-35b-a3b-nvfp4  | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | BYTE-IDENTICAL  |
-+
-+(Built on the `paged` branch f32-default = 0026 hybrid default is f32; the baseline was re-confirmed
-+on the same build before the edit.)
-+
-+## nsys proof - the gather is eliminated (MoE decode, npp128 ntg24 npl128, same window)
-+
-+| kernel                              | before        | after                         |
-+|-------------------------------------|---------------|-------------------------------|
-+| `k_get_rows_float<float,float>` cnt | 10174         | 9454 (720 fewer = 30 GDN x 24)|
-+| `k_get_rows_float<float,float>` sum | 186.3 ms      | 102.8 ms (-83.5 ms)           |
-+| conv update kernel                  | `ssm_conv_update_f32` 720 | `ssm_conv_update_ids_f32` 720 |
-+| `ssm_conv_gather_nonident_kernel`   | -             | 720 x ~1.1 us = 0.8 ms (no-op at decode) |
-+
-+The 720 big ~115 us conv gathers are gone; the only added work is a ~1.1 us no-op gather kernel per
-+layer-step (all sequences identity during steady AR decode). This matches 0019's "no-op at decode,
-+median ~1.2 us" non-identity gather.
-+
-+## Preliminary throughput (post-fusion, single point; rigorous A/B is the bench phase)
-+
-+- MoE `q36-35b-a3b-nvfp4` npl128 (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`): **783.9 t/s**, step
-+  163.3 ms/step (MOE_GAP @0025 was 752.3 t/s / 169.8 ms/step => -6.5 ms/step in this stack).
-+- dense `q36-27b-nvfp4` npl128: **377.3 t/s** (~96% of vLLM 391; includes 0022/0026 base gains).
-+- npl128 ran clean (EXIT=0) on both - the non-identity boundary path does not crash.
-+
-+## Verdict
-+
-+Bit-exact (both md5 gates byte-identical, all test-backend-ops pass), the residual `k_get_rows` conv
-+gather is eliminated (nsys-confirmed), and decode throughput improves. Helps BOTH dense and MoE (the
-+shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path (after 0019 SSM-state
-++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
-+
-+Assisted-by: Claude:opus-4.8 [Claude Code]
-+
-+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
-+
-+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
-+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
-+
-+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
-+
-+| model             | base (0026)                      | lever1 (0028)                    | recorded baseline                |
-+|-------------------|----------------------------------|----------------------------------|----------------------------------|
-+| q36-27b-nvfp4     | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
-+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
-+
-+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
-+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
-+
-+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
-+
-+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
-+
-+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 391    |
-+|-----|-----------|-------------|--------|----------------|
-+| 32  | 208.56    | 209.39      | +0.40% | -              |
-+| 128 | 369.95    | 377.83      | +2.13% | 94.6% -> 96.6% |
-+
-+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
-+
-+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 901    |
-+|-----|-----------|-------------|--------|----------------|
-+| 32  | 456.85    | 459.56      | +0.59% | -              |
-+| 128 | 763.47    | 777.95      | +1.90% | 84.7% -> 86.3% |
-+
-+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
-+
-+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
-+
-+| kernel                          | base (0026)            | lever1 (0028)                                |
-+|---------------------------------|------------------------|----------------------------------------------|
-+| k_get_rows_float<float,float>   | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms                       |
-+| delta                           |                        | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
-+| ssm_conv_update(_ids)_f32       | 219.71 ms (update)     | 225.75 ms (update_ids, +6 ms)                |
-+| ssm_conv_gather_nonident_kernel | -                      | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
-+
-+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
-+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
-+the -3.13 ms/step throughput delta at npl128.
-+
-+### Verdict (gather-bench)
-+
-+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
-+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
-+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
-+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
 diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
 index 2a5cbce..5fa220a 100644
 --- a/ggml/include/ggml.h
--- a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
+++ b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
@@ -213,17 +213,87 @@ all 23 patches, and the resulting tree is **byte-identical to the gate-green
 shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
 the md5 bit-exact gate, and the bench.

-## Pre-existing finding (NOT introduced by this pin-sync, NOT fixed here)
-Committed patch `0019` carries a *modify* hunk against the dev-only doc
-`SSM_DECODE_FIX_RESULTS.md` (`index 2e7c8c2..77879e4 100644`), a file that exists
-only because of an unshipped docs commit on the dev tree and is absent from a
-clean llama.cpp checkout. Under strict `git apply` that hunk fails ("No such file
-or directory"). This is pin-independent (the file is upstream-absent on both
-`8be759e6` and `9d5d882d`) and present identically in the old and new `0019`
-(LINENUM class), so it is left untouched to keep the pin-sync faithful. (`0021`'s
-`CONV_STATE_FUSION_RESULTS.md` is a *create* hunk and applies fine.) Stripping the
-stray dev-doc hunks from the shipped patches is a separate cleanup, out of scope
-for the pin-sync.
+## Shipped-build bug FIXED: stray dev-doc hunks stripped from the patch series
+
+The pin-sync export captured dev-only result/progress docs that live in the DGX
+dev tree (`~/llama-paged-dev`) but are ABSENT from a clean `ggml-org/llama.cpp`
+checkout. The shipped build applies the paged series with **strict `git apply`**
+(the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
+`git apply --verbose "$p" || { echo "paged patch failed"; exit 1; }`), which is
+atomic: a single hunk against a missing file REJECTS the entire patch and the
+`exit 1` fails the build. (`prepare.sh` uses tolerant `patch -pN -N ... || true`,
+but it is guarded by the `src/paged-kv-manager.cpp` sentinel and skipped at build
+time once the Makefile has applied the series, so the strict `git apply` is the
+real shipped path.)
+
+Root failure was patch `0019`'s *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
+(`index 2e7c8c2..77879e4 100644`): on a clean tree `git apply` cannot find the
+file to modify ("No such file or directory") and rejects all of `0019`, which
+then cascades to `0021`/`0022`/`0026`/`0028` (they build on `0019`'s code). The
+build therefore only succeeded on the DGX (where the doc exists) and FAILED on CI
+/ any clean checkout.
+
+Fixed by stripping every stray non-source hunk so the patches contain ONLY
+llama.cpp source changes. Stripped hunks (dev docs absent from a clean
+`9d5d882d` checkout):
+
+| patch | stripped dev-doc hunk(s) | hunk kind |
+|-------|--------------------------|-----------|
+| `0019` | `SSM_DECODE_FIX_RESULTS.md` | modify (the root reject) |
+| `0020` | `LEVER1_OPROJ_MMQ_RESULTS.md` | create |
+| `0021` | `CONV_STATE_FUSION_RESULTS.md` | create |
+| `0028` | `LEVER1_GATHER_PROGRESS.md`, `LEVER1_GATHER_RESULTS.md` | create |
+
+(The `create` hunks did not reject on their own - `git apply` will create a new
+file even on a clean tree - but they polluted the build tree with stray dev docs
+and violated the source-only invariant, so they were stripped too.) For each
+patch the `diff --git a/<devdoc> ...` section was removed along with its diffstat
+per-file line, any `create mode` trailer, and the `N files changed, ...` summary
+was corrected; **every llama.cpp SOURCE hunk is byte-identical** (verified by
+sha256 of each patch's source-diff tail before vs after the strip).
+
+Verified on a fresh `git clone` of `ggml-org/llama.cpp` at this pin `9d5d882d`:
+- BEFORE the strip, strict `git apply` of the series: OK through `0018`, then
+  `0019` FAILS ("SSM_DECODE_FIX_RESULTS.md: No such file or directory") -> the
+  Makefile `exit 1`s; continue-mode shows the full cascade `0019` `0021` `0022`
+  `0026` `0028` failing.
+- AFTER the strip, strict `git apply` of the full series `0001..0030` reaches
+  **exit 0** (every patch OK, sentinel `src/paged-kv-manager.cpp` created, zero
+  stray `*_RESULTS.md`/`*_PROGRESS.md` in the tree). The tolerant `patch -p1`
+  path (prepare.sh fallback) also applies with zero rejects.
+
+## Durable fix: keep patch exports SOURCE-ONLY
+
+The pin-sync / re-export step MUST NOT capture dev-only artifacts into the shipped
+`.patch` files. A clean `ggml-org/llama.cpp` checkout contains its own real docs
+(`README.md`, `docs/`, `AGENTS.md`, ...) but NOT LocalAI dev notes - anything
+matching `*_RESULTS.md`, `*_PROGRESS.md`, `*.diff`, `final_benchmark.csv`,
+`LEVER*`, `BENCH*`, `paged-*-bench.cpp`, or any path that does not exist at the
+pin is a dev artifact and must be excluded. Concretely, when re-exporting:
+
+- prefer `git format-patch -1 <commit> -- ':!*.md' ':!*.diff' ':!*.csv'` (or an
+  explicit pathspec of the llama.cpp source dirs `src/ ggml/ common/ include/
+  tools/ tests/ cmake/`) so dev docs never enter the patch body;
+- keep the dev-notes commits SEPARATE from the code commits on the dev branch, so
+  a per-commit export is naturally source-only;
+- after export, gate with: clone the pin, `git apply` the full series with strict
+  (no-`--exclude`, no `|| true`) `git apply` - it MUST reach exit 0. The weekly
+  canary (`.github/workflows/llama-cpp-paged-canary.yml`) does this against
+  upstream HEAD; now that the patches are source-only its `0019`
+  `SSM_DECODE_FIX_RESULTS.md` `--exclude` workaround
+  (`.github/scripts/paged-canary-apply.sh`) is no longer needed and can be removed
+  on the next canary touch.
+
+The upcoming `c299a92c` pin-bump re-export MUST follow this: produce source-only
+patches and pass the strict-`git apply` gate on a clean checkout before advancing
+the pin.
+
+## Historical note (pre-strip)
+Before this cleanup, `0019` carried the `SSM_DECODE_FIX_RESULTS.md` modify hunk
+identically in the old and new exports (LINENUM class) and was left untouched
+during the pin-sync to keep the rebase faithful; `0021`'s
+`CONV_STATE_FUSION_RESULTS.md` was a create hunk that applied but still leaked a
+dev doc. Both are now removed by the source-only strip above.

 ## Source of truth
 The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD