Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention

# Conflicts:
#	gallery/index.yaml
This commit is contained in:
Ettore Di Giacinto
2026-06-26 21:38:56 +00:00
11 changed files with 330 additions and 50 deletions

View File

@@ -1,4 +1,4 @@
From 944636cf34b486d4035575e48845840368de0743 Mon Sep 17 00:00:00 2001
From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 22:58:47 +0200
Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
@@ -46,22 +46,56 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
LEVER1_GATHER_RESULTS.md | 110 +++++++++++++++++++++++
ggml/include/ggml.h | 20 +++++
ggml/src/ggml-cpu/ops.cpp | 90 ++++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++++-
LEVER1_GATHER_PROGRESS.md | 26 ++++++
LEVER1_GATHER_RESULTS.md | 163 +++++++++++++++++++++++++++++++++
ggml/include/ggml.h | 20 ++++
ggml/src/ggml-cpu/ops.cpp | 90 +++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
ggml/src/ggml.c | 62 +++++++++++++
src/models/delta-net-base.cpp | 26 ++++--
tests/test-backend-ops.cpp | 69 +++++++++++++++
7 files changed, 521 insertions(+), 11 deletions(-)
tests/test-backend-ops.cpp | 69 ++++++++++++++
8 files changed, 600 insertions(+), 11 deletions(-)
create mode 100644 LEVER1_GATHER_PROGRESS.md
create mode 100644 LEVER1_GATHER_RESULTS.md
diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md
new file mode 100644
index 0000000..e4d14b9
--- /dev/null
+++ b/LEVER1_GATHER_PROGRESS.md
@@ -0,0 +1,26 @@
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
+
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
+
+## What
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).
+
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
+ MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+ GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
+- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
+ step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
+
+## Artifacts
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md
new file mode 100644
index 0000000..c78e3c0
index 0000000..afced02
--- /dev/null
+++ b/LEVER1_GATHER_RESULTS.md
@@ -0,0 +1,110 @@
@@ -0,0 +1,163 @@
+# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
+
+The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
@@ -172,6 +206,59 @@ index 0000000..c78e3c0
++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
+
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
+
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
+
+| model | base (0026) | lever1 (0028) | recorded baseline |
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
+| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
+
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
+
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
+
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
+|-----|-----------|-------------|--------|----------------|
+| 32 | 208.56 | 209.39 | +0.40% | - |
+| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
+
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
+
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
+|-----|-----------|-------------|--------|----------------|
+| 32 | 456.85 | 459.56 | +0.59% | - |
+| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
+
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
+
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
+
+| kernel | base (0026) | lever1 (0028) |
+|---------------------------------|------------------------|----------------------------------------------|
+| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
+| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
+| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
+| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
+
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
+the -3.13 ms/step throughput delta at npl128.
+
+### Verdict (gather-bench)
+
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 2a5cbce..5fa220a 100644
--- a/ggml/include/ggml.h

View File

@@ -1,42 +1,26 @@
# LEVER1_GATHER_PROGRESS.md - gather-build GPU agent checkpoint
# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
Status: **DONE.** Residual k_get_rows fused in-place, bit-exact, both gates pass. Patch 0028.
STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
## Lever
Fuse the residual `k_get_rows_float` in the GDN decode path (the biggest single kernel vLLM lacks,
~5.2 ms/step MoE per MOE_GAP_VS_VLLM.md). 0019 fused the SSM-state gather; 0021 fused the conv
compute but kept a `build_rs` gather for the conv taps. This patch closes that last gather.
## What
Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
(read path gather -> indexed in-kernel read; values + reduction order unchanged).
## Located (nsys, DGX GB10, MoE q36-35b-a3b-nvfp4, npp128 ntg24 npl128)
The residual is the **conv-state tap gather** in `build_conv_state_fused`
(`src/models/delta-net-base.cpp`): the plain 4-arg `build_rs` -> `ggml_get_rows` of n_embd_r = 24576
floats (= (d_conv-1)*(d_inner + 2*n_group*d_state) = 3*8192) x 128 seqs, once per GDN layer per step.
Decode-window `k_get_rows_float<float,float>` had a BIG cluster of ~720 instances (30 GDN x 24) at
~115 us = ~3.4 ms/step (5.2 ms/step at steady ntg=128). grid (ne10=128, block_num_y=96) confirmed
ne00=24576 == n_embd_r (the SSM n_embd_s=524288 gather is already fused by 0019).
## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
## Built (paged branch f32 default = 0026 hybrid default is f32)
New op `ggml_ssm_conv_update_inplace_ids` (src[4]=ids, op_params[1]=rs_head): reads each seq's prior
taps from cache[ids[s]] in-kernel (identity -> in place from conv_state_dst; non-identity -> disjoint
scratch via ssm_conv_gather_nonident_kernel). Mirrors 0019. Files: ggml.h, ggml.c, ssm-conv.cu,
ggml-cpu/ops.cpp, delta-net-base.cpp, tests/test-backend-ops.cpp. Build EXIT=0.
## GATE - PASS
- test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS OK (new), SSM_CONV_UPDATE OK, SSM_CONV OK,
GATED_DELTA_NET OK, GET_ROWS OK.
- greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models:
dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd (== baseline).
- nsys: k_get_rows<float,float> 10174 -> 9454 (720 fewer), 186.3 -> 102.8 ms; conv gathers replaced
by 720 x ~1.1 us no-op gather. MoE npl128 783.9 t/s (step 163.3 ms vs 169.8 @0025), dense 377.3 t/s.
## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
## Artifacts
- DGX: commit `944636c` on branch `paged`; LEVER1_GATHER_RESULTS.md in llama tree; nsys
`/tmp/kgr_moe.nsys-rep` (before) + `/tmp/kgr_moe_after.nsys-rep` (after).
- LocalAI worktree: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch + LEVER1_GATHER_RESULTS.md.
- BOTH trees committed (-s). NOT pushed.
## Next
Ready for the rigorous same-session A/B decode bench (npl 32/128, dense + MoE, before/after on the
same 0026 base). The kernel-elimination and bit-exactness are proven; the bench quantifies the lift.
Assisted-by: Claude:opus-4.8 [Claude Code]
- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt

View File

@@ -108,3 +108,56 @@ shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path
+ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
Assisted-by: Claude:opus-4.8 [Claude Code]
## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
| model | base (0026) | lever1 (0028) | recorded baseline |
|-------------------|----------------------------------|----------------------------------|----------------------------------|
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
|-----|-----------|-------------|--------|----------------|
| 32 | 208.56 | 209.39 | +0.40% | - |
| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
|-----|-----------|-------------|--------|----------------|
| 32 | 456.85 | 459.56 | +0.59% | - |
| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
| kernel | base (0026) | lever1 (0028) |
|---------------------------------|------------------------|----------------------------------------------|
| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
the -3.13 ms/step throughput delta at npl128.
### Verdict (gather-bench)
Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.