fix(paged): strip stray dev-doc hunks so patch series applies on a clean checkout

The shipped from-patches build applies the paged series with strict `git apply`
(backend/cpp/llama-cpp/Makefile `llama.cpp` target:
`git apply --verbose "$p" || { ...; exit 1; }`), which is atomic: a hunk against
a file missing from the tree rejects the whole patch and fails the build. Four
patches carried hunks against dev-only docs that live in the DGX dev tree but are
absent from a clean ggml-org/llama.cpp checkout, so the build only succeeded on
the DGX and FAILED on CI / any clean checkout:

  0019 -> SSM_DECODE_FIX_RESULTS.md   (modify hunk = the root reject)
  0020 -> LEVER1_OPROJ_MMQ_RESULTS.md (create)
  0021 -> CONV_STATE_FUSION_RESULTS.md (create)
  0028 -> LEVER1_GATHER_PROGRESS.md, LEVER1_GATHER_RESULTS.md (create)

0019's reject cascaded to 0021/0022/0026/0028 (which build on 0019's code). Strip
each `diff --git a/<devdoc>` section plus its diffstat line, `create mode`
trailer, and correct the summary count. Every llama.cpp SOURCE hunk is left
byte-identical (verified by sha256 of each patch's source-diff tail).

Verified on a fresh clone of ggml-org/llama.cpp at the pin 9d5d882d: BEFORE,
strict `git apply` failed at 0019 (cascade 0019/0021/0022/0026/0028); AFTER, the
full series 0001-0030 applies with exit 0 (sentinel created, zero stray docs).
The tolerant `patch -p1` fallback in prepare.sh also applies with zero rejects.

PIN_SYNC_9d5d882d.md documents the durable fix: re-exports/pin-syncs must keep
patches source-only (export with a source pathspec / `:!*.md`, gate with a strict
`git apply` on a clean checkout). The upcoming c299a92c pin-bump re-export must
produce source-only patches too.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 08:39:27 +00:00
parent 2bee7a5ab1
commit 7e1832b868
5 changed files with 85 additions and 514 deletions

View File

@@ -54,7 +54,6 @@ is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
SSM_DECODE_FIX_RESULTS.md | 86 +++++++++++++++++++++++++++
ggml/include/ggml.h | 17 ++++++
ggml/src/ggml-cpu/ops.cpp | 49 ++++++++++++++-
ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
@@ -63,102 +62,8 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
src/models/models.h | 13 ++++
src/models/qwen35.cpp | 6 +-
src/models/qwen35moe.cpp | 6 +-
9 files changed, 378 insertions(+), 23 deletions(-)
8 files changed, 292 insertions(+), 23 deletions(-)
diff --git a/SSM_DECODE_FIX_RESULTS.md b/SSM_DECODE_FIX_RESULTS.md
index 2e7c8c2..77879e4 100644
--- a/SSM_DECODE_FIX_RESULTS.md
+++ b/SSM_DECODE_FIX_RESULTS.md
@@ -96,3 +96,89 @@ precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLL
after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
track. No paged/graph/block-table change can move decode on this model (full
attention is 0.4% of decode).
+
+## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
+
+After Step 1 the largest non-GEMM decode bucket was the recurrent-state
+`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
+sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
+gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
+`ggml_ssm_scan`'s `ids` source.
+
+`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
+`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
+`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
+`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
+writes the cache directly: no recurrent-state materialization at all. The
+`build_recurrent_attn` fused path feeds the full cache and ids through the
+`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
+`rs_zero` clear and the extra-states copy around the op.
+
+### Race-free by construction (CUDA)
+
+In-place write plus an ids read of the same cache is only safe when the read slot
+equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
+(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
+can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
+prefill ubatch). The kernel handles both per (seq, head) block on device:
+
+- identity sequences read `s0` in place from the destination slot `state_dst`
+ (the kernel loads all of `s0` into registers before it writes the new state,
+ so reading and writing the same slot is race-free) -- no materialization;
+- non-identity sequences read from a disjoint scratch that a small
+ `gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
+ recurrence never reads a slot another block writes.
+
+`ids` stays a device pointer (dereferenced only in the kernels; the input is
+device-resident at op-execute time, so a host read segfaults). The CPU op
+mirrors the same logic (host identity check + a serial gather in the dispatcher
+for the non-identity case). The math is unchanged, so the result is bit-identical
+to the `get_rows` path in every case.
+
+Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
+`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
+untouched (they keep the materialized-state overload).
+
+### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
+
+Dense `q36-27b-nvfp4`:
+
+| npl | Step 1 (baseline) | Step 2 | delta | % of vLLM (391 @128) |
+|-----|-------------------|----------|---------|----------------------|
+| 32 | 137.64 | 170.68 | +24.0% | - |
+| 128 | 186.25 | 256.57 | +37.8% | 47.6% -> 65.6% |
+
+The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
+
+MoE `q36-35b-a3b-nvfp4`:
+
+| npl | Step 1 (baseline) | Step 2 | delta |
+|-----|-------------------|----------|---------|
+| 32 | 299.68 | 366.69 | +22.4% |
+| 128 | 409.30 | 553.63 | +35.3% |
+
+(Step-1 baselines re-measured in the same session; the brief's reference figures
+were 136 / 180 dense and 279 / 373 MoE.)
+
+### Bit-exact gate
+
+Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
+fusion off) vs the Step-1 build:
+
+- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
+- MoE `q36-35b-a3b-nvfp4`: byte-identical;
+- Step-2 dense run1 == run2 (deterministic, no race).
+
+### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
+
+The recurrent-state gather bucket collapsed:
+
+| kernel | Step 1 | Step 2 |
+|----------------------------|----------|-----------------------------------------|
+| `k_get_rows_float` | 18.8% | 0.7% (residual: embeddings / conv-state)|
+| `gdn_gather_nonident` | - | 1.7% (no-op at decode, median ~1.2 us) |
+| `gated_delta_net_cuda` | 26.0% | 22.5% |
+| FP4 GEMM family | ~37.5% | ~48% (now the dominant residual) |
+
+The SSM state gather is effectively eliminated. The residual decode gap to vLLM
+is now the FP4 GEMM (~48% of decode), a separate kernel track.
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 4e7ab32..951dd21 100644
--- a/ggml/include/ggml.h

View File

@@ -43,96 +43,11 @@ vs 2.77 ms/call for the old GEMV.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++
src/models/qwen35.cpp | 13 ++++---
src/models/qwen35moe.cpp | 13 ++++---
src/models/qwen3next.cpp | 13 ++++---
4 files changed, 98 insertions(+), 18 deletions(-)
create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md
3 files changed, 21 insertions(+), 18 deletions(-)
diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md
new file mode 100644
index 0000000..9a5721f
--- /dev/null
+++ b/LEVER1_OPROJ_MMQ_RESULTS.md
@@ -0,0 +1,77 @@
+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
+
+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
+bit-exact tensor reshape that re-routes the per-layer SSM output projection
+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
+
+## The mechanism (profiled, both engines)
+
+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
+the ssm_out weight read across the 128 sequences. vLLM packs the same projection
+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
+only the output projection was in 3D SSM layout.
+
+## The fix
+
+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
+proven by the in-projection.
+
+```
+- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+ ...
+ cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+```
+
+## Gates (all PASS)
+
+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
+ post-SSM baseline build:
+ - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
+ - MoE q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
+- Coherent dense + MoE output (greedy text inspected).
+
+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
+
+S_TG t/s (decode aggregate):
+
+| model | npl | baseline | Lever 1 | delta |
+|------------------|-----|----------|---------|---------|
+| dense q36-27b | 32 | 170.52 | 200.00 | +17.3% |
+| dense q36-27b | 128 | 254.92 | 335.80 | +31.7% |
+| MoE q36-35b-a3b| 32 | 373.28 | 420.77 | +12.7% |
+| MoE q36-35b-a3b| 128 | 560.66 | 691.24 | +23.3% |
+
+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
+up from 65% post-SSM).
+
+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
+
+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
+
+| kernel | baseline | Lever 1 |
+|-------------------------------------|--------------------|------------------|
+| mul_mat_vec_q<NVFP4, m=1> (o_proj) | 132.8 ms / 48 inst | 0 ms / 0 inst |
+| mul_mat_q<NVFP4, m=128> | 5463 ms / 8800 inst| 5827 ms /10000 inst|
+
+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
+(+1200 instances, +363 ms over the window), and its per-call average DROPS
+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
+old GEMV: the amortized weight read is the win.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 0be3247..0874c43 100644
--- a/src/models/qwen35.cpp

View File

@@ -56,7 +56,6 @@ conv-cache plumbing.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
CONV_STATE_FUSION_RESULTS.md | 106 +++++++++++++++++++++++++++++++
ggml/include/ggml.h | 16 +++++
ggml/src/ggml-cpu/ops.cpp | 73 ++++++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++
@@ -67,121 +66,8 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
src/models/qwen35moe.cpp | 23 +++++--
src/models/qwen3next.cpp | 29 ++++++---
tests/test-backend-ops.cpp | 47 ++++++++++++++
11 files changed, 526 insertions(+), 22 deletions(-)
create mode 100644 CONV_STATE_FUSION_RESULTS.md
10 files changed, 420 insertions(+), 22 deletions(-)
diff --git a/CONV_STATE_FUSION_RESULTS.md b/CONV_STATE_FUSION_RESULTS.md
new file mode 100644
index 0000000..f59b6e5
--- /dev/null
+++ b/CONV_STATE_FUSION_RESULTS.md
@@ -0,0 +1,106 @@
+# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
+
+The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
+After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
+the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
+
+## What changed
+
+A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
+non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
+
+ qkv_mixed transpose -> ggml_concat (build width-K window) [concat_cont 8.14 ms/step]
+ -> ggml_ssm_conv (depthwise conv) [ssm_conv_f32 ~8.6 ms/step]
+ -> ggml_silu [folded into ssm_conv on CUDA]
+ -> ggml_cpy of the shifted ring state into the conv cache [cpy_scalar 5.76 ms/step]
+
+with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
+the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
+ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
+the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
+the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
+vLLM's `causal_conv1d_update`.
+
+Files:
+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
+ (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
+ src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
+- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
+ (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
+ `ggml_cuda_op_ssm_conv`.
+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
+ channels) + branch in `ggml_compute_forward_ssm_conv`.
+- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
+ conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
+ target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
+- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
+ (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
+ rollback keep the existing concat+ssm_conv+silu+cpy chain.
+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
+ vs the CPU reference across backends.
+
+## Gate: test-backend-ops (CUDA0 vs CPU reference)
+
+- SSM_CONV: 45/45 OK (unchanged path intact)
+- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
+- SSM_CONV_BIAS_SILU: 90/90 OK
+
+## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
+
+Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
+
+| model | baseline md5 | fused md5 | result |
+|--------------------|----------------------------------|----------------------------------|-----------------|
+| q36-27b-nvfp4 | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL |
+| q36-35b-a3b-nvfp4 | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL |
+
+## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
+
+Dense q36-27b-nvfp4:
+
+| mode | npl | baseline | fused | delta |
+|-----------|-----|----------|--------|---------|
+| CUDA-graph| 32 | 199.76 | 202.99 | +1.6% |
+| CUDA-graph| 128 | 336.35 | 347.14 | +3.2% |
+| eager | 32 | 196.07 | 197.61 | +0.8% |
+| eager | 128 | 333.62 | 342.97 | +2.8% |
+
+MoE q36-35b-a3b-nvfp4:
+
+| mode | npl | baseline | fused | delta |
+|-----------|-----|----------|--------|---------|
+| CUDA-graph| 32 | 421.72 | 432.39 | +2.5% |
+| CUDA-graph| 128 | 689.74 | 713.54 | +3.5% |
+| eager | 32 | 421.05 | 432.46 | +2.7% |
+| eager | 128 | 689.15 | 713.87 | +3.6% |
+
+Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
+**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
+
+## Step time + nsys kernel delta
+
+Per-step decode time (dense npl128, T_TG / ntg=128):
+- baseline 48.711 s / 128 = 380.6 ms/step
+- fused 47.197 s / 128 = 368.7 ms/step -> **-11.9 ms/step** (matches the predicted +12-14 ms)
+- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
+
+nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
+
+| kernel | baseline calls | fused calls | per-step (eager) |
+|---------------------|----------------|-------------|------------------|
+| concat_cont (decode)| 1152 | 0 (GONE) | 7.95 -> 0 ms |
+| cpy_scalar (decode) | 1152 of 3648 | 0 (GONE) | 4.29 -> 0 ms |
+| ssm_conv_f32 (decode)| 1152 of 2736 | 0 (prefill-only) | 8.65 -> 0 ms |
+| ssm_conv_update | 0 | 1152 | 0 -> 7.56 ms |
+
+Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
+and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
+prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
+
+## Verdict
+
+Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
+391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
+de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 951dd21..76fa401 100644
--- a/ggml/include/ggml.h

View File

@@ -46,219 +46,14 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
LEVER1_GATHER_PROGRESS.md | 26 ++++++
LEVER1_GATHER_RESULTS.md | 163 +++++++++++++++++++++++++++++++++
ggml/include/ggml.h | 20 ++++
ggml/src/ggml-cpu/ops.cpp | 90 +++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
ggml/src/ggml.c | 62 +++++++++++++
src/models/delta-net-base.cpp | 26 ++++--
tests/test-backend-ops.cpp | 69 ++++++++++++++
8 files changed, 600 insertions(+), 11 deletions(-)
create mode 100644 LEVER1_GATHER_PROGRESS.md
create mode 100644 LEVER1_GATHER_RESULTS.md
6 files changed, 411 insertions(+), 11 deletions(-)
diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md
new file mode 100644
index 0000000..e4d14b9
--- /dev/null
+++ b/LEVER1_GATHER_PROGRESS.md
@@ -0,0 +1,26 @@
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
+
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
+
+## What
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).
+
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
+ MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+ GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
+- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
+ step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
+
+## Artifacts
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md
new file mode 100644
index 0000000..afced02
--- /dev/null
+++ b/LEVER1_GATHER_RESULTS.md
@@ -0,0 +1,163 @@
+# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
+
+The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
+kernel vLLM has no equivalent of (~5.2 ms/step MoE decode; also present in dense): vLLM updates its
+gated-DeltaNet recurrent state in-place inside the fused decode kernel, while llama ran a separate
+`ggml_get_rows` gather. Patch 0019 fused the SSM recurrent-state gather; patch 0021 fused the conv
+compute/write-back but KEPT a `build_rs` gather for the conv taps ("tiny; not one of the eliminated
+buckets"). This patch closes that residual.
+
+## Which gather was still firing (nsys-located, DGX GB10 sm_121)
+
+Profiled MoE `q36-35b-a3b-nvfp4` at batch-128 decode (`llama-batched-bench -npp128 -ntg24 -npl128
+-fa on`, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`). The decode-window `k_get_rows_float<float,float>`
+distribution was bimodal: a BIG cluster of **~720 instances (= 30 GDN layers x 24 decode steps) at
+~115 us each** plus small embedding/router gathers.
+
+The big gather's geometry (`grid=(ne10=128, block_num_y=96, 1)`) decodes to **128 rows (= n_seqs
+active sequences) of ne00 = 24576 floats**. With the model's real dims (`d_conv=4, d_inner=4096,
+n_group=16, d_state=128`):
+- `n_embd_r = (d_conv-1) * (d_inner + 2*n_group*d_state) = 3 * 8192 = 24576` -> `block_num_y=96` EXACT match.
+- `n_embd_s = d_state * d_inner = 524288` (the SSM state, gridY 2048 - already fused by 0019).
+
+So the residual `k_get_rows` is the **conv-state tap gather** in `build_conv_state_fused`
+(`src/models/delta-net-base.cpp`), which called the plain 4-arg `build_rs` -> `ggml_get_rows` of the
+24576-float conv-state row x 128 sequences, once per GDN layer per decode step (~3.4 ms/step here,
+~5.2 ms/step at steady ntg=128). The SSM-state gather is already fused, so this conv gather is the
+last `k_get_rows` in the GDN decode path.
+
+## What changed (mirror of the 0019 SSM gather fusion; bit-exact by construction)
+
+New op `ggml_ssm_conv_update_inplace_ids` (reuses `GGML_OP_SSM_CONV`, discriminated by a non-null
+`src[4]` = ids). Instead of a pre-gathered tap scratch, it takes the FULL conv-state cache (`src[0]`)
+plus the per-sequence `ids` (= the recurrent-state `s_copy`, `src[4]`; `op_params[1]=rs_head`) and
+reads each active sequence's prior K-1 taps directly from `cache[ids[s]]` in the kernel. This removes
+the separate `k_get_rows` launch.
+
+Race-free, exactly mirroring 0019:
+- **Identity** sequences (`ids[s] == rs_head + s`, the whole AR-decode path) read the taps in place
+ from the `conv_state_dst` write slot. The kernel loads the full conv window into registers before
+ it writes the 1-token-shifted ring back, so read==write slot is race-free per (channel, seq) thread.
+- **Non-identity** sequences (reorder / `rs_zero` remap at a prefill->decode boundary) are gathered
+ into a disjoint scratch by a small `ssm_conv_gather_nonident_kernel` first (no-op at steady decode),
+ so the update kernel never reads a slot another block writes.
+
+The read VALUES are unchanged (identity in-place taps == the gathered taps == `cache[ids[s]]`); only
+the read PATH changes from a `ggml_get_rows` materialization to an indexed in-kernel read. The conv
+math, ascending-tap FMA order, silu and the ring write-back are byte-identical to 0021.
+
+Files:
+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: `ggml_ssm_conv_update_inplace_ids` builder
+ (src[0]=full cache [K-1,channels,n_cells], src[1]=conv_kernel, src[2]=x_cur, src[3]=conv_state_dst,
+ src[4]=ids; op_params[0]=fuse_silu, op_params[1]=rs_head).
+- `ggml/src/ggml-cuda/ssm-conv.cu`: `ssm_conv_gather_nonident_kernel` + `ssm_conv_update_ids_f32`
+ kernel + `ggml_cuda_op_ssm_conv_update_ids` + a `src[4]`-discriminated branch in `ggml_cuda_op_ssm_conv`.
+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_ids_f32` (window copied to a
+ local before the possibly-aliasing write) + dispatch branch.
+- `src/models/delta-net-base.cpp`: `build_conv_state_fused` now feeds the FULL cache + ids through the
+ `build_rs` `get_state_rows` lambda (the rs_zero clear + extra-states copy still run around it),
+ exactly like the 0019 recurrent-attn fusion. The `qwen35` / `qwen35moe` / `qwen3next` callers are
+ unchanged (they already route the single-token decode path here).
+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update_ids` (16 cases) - ids = a shuffled permutation
+ with `rs_head=0`, so each case exercises BOTH the identity in-place read and the non-identity cache
+ read; validates the conv+silu output vs the CPU reference.
+
+## GATE: test-backend-ops (CUDA0 vs CPU, 2/2 backends)
+
+- SSM_CONV_UPDATE_IDS: OK (NEW; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
+- SSM_CONV_UPDATE: OK (0021 path intact)
+- SSM_CONV: OK
+- GATED_DELTA_NET: OK
+- GET_ROWS: OK
+
+## GATE: greedy bit-exactness (--temp 0 --seed 1 -n 48, -fa on) - BOTH models BYTE-IDENTICAL
+
+| model | baseline md5 | 0028 md5 | result |
+|--------------------|----------------------------------|----------------------------------|-----------------|
+| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | BYTE-IDENTICAL |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | BYTE-IDENTICAL |
+
+(Built on the `paged` branch f32-default = 0026 hybrid default is f32; the baseline was re-confirmed
+on the same build before the edit.)
+
+## nsys proof - the gather is eliminated (MoE decode, npp128 ntg24 npl128, same window)
+
+| kernel | before | after |
+|-------------------------------------|---------------|-------------------------------|
+| `k_get_rows_float<float,float>` cnt | 10174 | 9454 (720 fewer = 30 GDN x 24)|
+| `k_get_rows_float<float,float>` sum | 186.3 ms | 102.8 ms (-83.5 ms) |
+| conv update kernel | `ssm_conv_update_f32` 720 | `ssm_conv_update_ids_f32` 720 |
+| `ssm_conv_gather_nonident_kernel` | - | 720 x ~1.1 us = 0.8 ms (no-op at decode) |
+
+The 720 big ~115 us conv gathers are gone; the only added work is a ~1.1 us no-op gather kernel per
+layer-step (all sequences identity during steady AR decode). This matches 0019's "no-op at decode,
+median ~1.2 us" non-identity gather.
+
+## Preliminary throughput (post-fusion, single point; rigorous A/B is the bench phase)
+
+- MoE `q36-35b-a3b-nvfp4` npl128 (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`): **783.9 t/s**, step
+ 163.3 ms/step (MOE_GAP @0025 was 752.3 t/s / 169.8 ms/step => -6.5 ms/step in this stack).
+- dense `q36-27b-nvfp4` npl128: **377.3 t/s** (~96% of vLLM 391; includes 0022/0026 base gains).
+- npl128 ran clean (EXIT=0) on both - the non-identity boundary path does not crash.
+
+## Verdict
+
+Bit-exact (both md5 gates byte-identical, all test-backend-ops pass), the residual `k_get_rows` conv
+gather is eliminated (nsys-confirmed), and decode throughput improves. Helps BOTH dense and MoE (the
+shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path (after 0019 SSM-state
++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
+
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
+
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
+
+| model | base (0026) | lever1 (0028) | recorded baseline |
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
+| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
+
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
+
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
+
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
+|-----|-----------|-------------|--------|----------------|
+| 32 | 208.56 | 209.39 | +0.40% | - |
+| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
+
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
+
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
+|-----|-----------|-------------|--------|----------------|
+| 32 | 456.85 | 459.56 | +0.59% | - |
+| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
+
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
+
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
+
+| kernel | base (0026) | lever1 (0028) |
+|---------------------------------|------------------------|----------------------------------------------|
+| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
+| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
+| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
+| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
+
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
+the -3.13 ms/step throughput delta at npl128.
+
+### Verdict (gather-bench)
+
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 2a5cbce..5fa220a 100644
--- a/ggml/include/ggml.h

View File

@@ -213,17 +213,87 @@ all 23 patches, and the resulting tree is **byte-identical to the gate-green
shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
the md5 bit-exact gate, and the bench.
## Pre-existing finding (NOT introduced by this pin-sync, NOT fixed here)
Committed patch `0019` carries a *modify* hunk against the dev-only doc
`SSM_DECODE_FIX_RESULTS.md` (`index 2e7c8c2..77879e4 100644`), a file that exists
only because of an unshipped docs commit on the dev tree and is absent from a
clean llama.cpp checkout. Under strict `git apply` that hunk fails ("No such file
or directory"). This is pin-independent (the file is upstream-absent on both
`8be759e6` and `9d5d882d`) and present identically in the old and new `0019`
(LINENUM class), so it is left untouched to keep the pin-sync faithful. (`0021`'s
`CONV_STATE_FUSION_RESULTS.md` is a *create* hunk and applies fine.) Stripping the
stray dev-doc hunks from the shipped patches is a separate cleanup, out of scope
for the pin-sync.
## Shipped-build bug FIXED: stray dev-doc hunks stripped from the patch series
The pin-sync export captured dev-only result/progress docs that live in the DGX
dev tree (`~/llama-paged-dev`) but are ABSENT from a clean `ggml-org/llama.cpp`
checkout. The shipped build applies the paged series with **strict `git apply`**
(the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
`git apply --verbose "$p" || { echo "paged patch failed"; exit 1; }`), which is
atomic: a single hunk against a missing file REJECTS the entire patch and the
`exit 1` fails the build. (`prepare.sh` uses tolerant `patch -pN -N ... || true`,
but it is guarded by the `src/paged-kv-manager.cpp` sentinel and skipped at build
time once the Makefile has applied the series, so the strict `git apply` is the
real shipped path.)
Root failure was patch `0019`'s *modify* hunk against `SSM_DECODE_FIX_RESULTS.md`
(`index 2e7c8c2..77879e4 100644`): on a clean tree `git apply` cannot find the
file to modify ("No such file or directory") and rejects all of `0019`, which
then cascades to `0021`/`0022`/`0026`/`0028` (they build on `0019`'s code). The
build therefore only succeeded on the DGX (where the doc exists) and FAILED on CI
/ any clean checkout.
Fixed by stripping every stray non-source hunk so the patches contain ONLY
llama.cpp source changes. Stripped hunks (dev docs absent from a clean
`9d5d882d` checkout):
| patch | stripped dev-doc hunk(s) | hunk kind |
|-------|--------------------------|-----------|
| `0019` | `SSM_DECODE_FIX_RESULTS.md` | modify (the root reject) |
| `0020` | `LEVER1_OPROJ_MMQ_RESULTS.md` | create |
| `0021` | `CONV_STATE_FUSION_RESULTS.md` | create |
| `0028` | `LEVER1_GATHER_PROGRESS.md`, `LEVER1_GATHER_RESULTS.md` | create |
(The `create` hunks did not reject on their own - `git apply` will create a new
file even on a clean tree - but they polluted the build tree with stray dev docs
and violated the source-only invariant, so they were stripped too.) For each
patch the `diff --git a/<devdoc> ...` section was removed along with its diffstat
per-file line, any `create mode` trailer, and the `N files changed, ...` summary
was corrected; **every llama.cpp SOURCE hunk is byte-identical** (verified by
sha256 of each patch's source-diff tail before vs after the strip).
Verified on a fresh `git clone` of `ggml-org/llama.cpp` at this pin `9d5d882d`:
- BEFORE the strip, strict `git apply` of the series: OK through `0018`, then
`0019` FAILS ("SSM_DECODE_FIX_RESULTS.md: No such file or directory") -> the
Makefile `exit 1`s; continue-mode shows the full cascade `0019` `0021` `0022`
`0026` `0028` failing.
- AFTER the strip, strict `git apply` of the full series `0001..0030` reaches
**exit 0** (every patch OK, sentinel `src/paged-kv-manager.cpp` created, zero
stray `*_RESULTS.md`/`*_PROGRESS.md` in the tree). The tolerant `patch -p1`
path (prepare.sh fallback) also applies with zero rejects.
## Durable fix: keep patch exports SOURCE-ONLY
The pin-sync / re-export step MUST NOT capture dev-only artifacts into the shipped
`.patch` files. A clean `ggml-org/llama.cpp` checkout contains its own real docs
(`README.md`, `docs/`, `AGENTS.md`, ...) but NOT LocalAI dev notes - anything
matching `*_RESULTS.md`, `*_PROGRESS.md`, `*.diff`, `final_benchmark.csv`,
`LEVER*`, `BENCH*`, `paged-*-bench.cpp`, or any path that does not exist at the
pin is a dev artifact and must be excluded. Concretely, when re-exporting:
- prefer `git format-patch -1 <commit> -- ':!*.md' ':!*.diff' ':!*.csv'` (or an
explicit pathspec of the llama.cpp source dirs `src/ ggml/ common/ include/
tools/ tests/ cmake/`) so dev docs never enter the patch body;
- keep the dev-notes commits SEPARATE from the code commits on the dev branch, so
a per-commit export is naturally source-only;
- after export, gate with: clone the pin, `git apply` the full series with strict
(no-`--exclude`, no `|| true`) `git apply` - it MUST reach exit 0. The weekly
canary (`.github/workflows/llama-cpp-paged-canary.yml`) does this against
upstream HEAD; now that the patches are source-only its `0019`
`SSM_DECODE_FIX_RESULTS.md` `--exclude` workaround
(`.github/scripts/paged-canary-apply.sh`) is no longer needed and can be removed
on the next canary touch.
The upcoming `c299a92c` pin-bump re-export MUST follow this: produce source-only
patches and pass the strict-`git apply` gate on a clean checkout before advancing
the pin.
## Historical note (pre-strip)
Before this cleanup, `0019` carried the `SSM_DECODE_FIX_RESULTS.md` modify hunk
identically in the old and new exports (LINENUM class) and was left untouched
during the pin-sync to keep the rebase faithful; `0021`'s
`CONV_STATE_FUSION_RESULTS.md` was a create hunk that applied but still leaked a
dev doc. Both are now removed by the source-only strip above.
## Source of truth
The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD