Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention

# Conflicts: # gallery/index.yaml
2026-06-27 09:57:14 -04:00 · 2026-06-26 21:38:56 +00:00
parent 6dd8a3d895 56600eec3e
commit c1f1d1e8ea
11 changed files with 330 additions and 50 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
@@ -1,4 +1,4 @@
-From 944636cf34b486d4035575e48845840368de0743 Mon Sep 17 00:00:00 2001
+From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
 Date: Fri, 26 Jun 2026 22:58:47 +0200
 Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
@@ -46,22 +46,56 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- LEVER1_GATHER_RESULTS.md       | 110 +++++++++++++++++++++++
- ggml/include/ggml.h            |  20 +++++
- ggml/src/ggml-cpu/ops.cpp      |  90 ++++++++++++++++++-
- ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++++-
+ LEVER1_GATHER_PROGRESS.md      |  26 ++++++
+ LEVER1_GATHER_RESULTS.md       | 163 +++++++++++++++++++++++++++++++++
+ ggml/include/ggml.h            |  20 ++++
+ ggml/src/ggml-cpu/ops.cpp      |  90 +++++++++++++++++-
+ ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
 ggml/src/ggml.c                |  62 +++++++++++++
 src/models/delta-net-base.cpp  |  26 ++++--
- tests/test-backend-ops.cpp     |  69 +++++++++++++++
- 7 files changed, 521 insertions(+), 11 deletions(-)
+ tests/test-backend-ops.cpp     |  69 ++++++++++++++
+ 8 files changed, 600 insertions(+), 11 deletions(-)
+ create mode 100644 LEVER1_GATHER_PROGRESS.md
 create mode 100644 LEVER1_GATHER_RESULTS.md

+diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md
+new file mode 100644
+index 0000000..e4d14b9
+--- /dev/null
+++ b/LEVER1_GATHER_PROGRESS.md
+@@ -0,0 +1,26 @@
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
+
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
+
+## What
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).
+
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
+  MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+  GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
+- MoE   npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
+  step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
+
+## Artifacts
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
 diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md
 new file mode 100644
-index 0000000..c78e3c0
+index 0000000..afced02
 --- /dev/null
 +++ b/LEVER1_GATHER_RESULTS.md
-@@ -0,0 +1,110 @@
+@@ -0,0 +1,163 @@
 +# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
 +
 +The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
@@ -172,6 +206,59 @@ index 0000000..c78e3c0
 ++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
 +
 +Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
+
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
+
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
+
+| model             | base (0026)                      | lever1 (0028)                    | recorded baseline                |
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
+| q36-27b-nvfp4     | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
+
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
+
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 391    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 208.56    | 209.39      | +0.40% | -              |
+| 128 | 369.95    | 377.83      | +2.13% | 94.6% -> 96.6% |
+
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 901    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 456.85    | 459.56      | +0.59% | -              |
+| 128 | 763.47    | 777.95      | +1.90% | 84.7% -> 86.3% |
+
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
+
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
+
+| kernel                          | base (0026)            | lever1 (0028)                                |
+|---------------------------------|------------------------|----------------------------------------------|
+| k_get_rows_float<float,float>   | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms                       |
+| delta                           |                        | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
+| ssm_conv_update(_ids)_f32       | 219.71 ms (update)     | 225.75 ms (update_ids, +6 ms)                |
+| ssm_conv_gather_nonident_kernel | -                      | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
+
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
+the -3.13 ms/step throughput delta at npl128.
+
+### Verdict (gather-bench)
+
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
 diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
 index 2a5cbce..5fa220a 100644
 --- a/ggml/include/ggml.h
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_PROGRESS.md
@@ -1,42 +1,26 @@
-# LEVER1_GATHER_PROGRESS.md - gather-build GPU agent checkpoint
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE

-Status: **DONE.** Residual k_get_rows fused in-place, bit-exact, both gates pass. Patch 0028.
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.

-## Lever
-Fuse the residual `k_get_rows_float` in the GDN decode path (the biggest single kernel vLLM lacks,
-~5.2 ms/step MoE per MOE_GAP_VS_VLLM.md). 0019 fused the SSM-state gather; 0021 fused the conv
-compute but kept a `build_rs` gather for the conv taps. This patch closes that last gather.
+## What
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).

-## Located (nsys, DGX GB10, MoE q36-35b-a3b-nvfp4, npp128 ntg24 npl128)
-The residual is the **conv-state tap gather** in `build_conv_state_fused`
-(`src/models/delta-net-base.cpp`): the plain 4-arg `build_rs` -> `ggml_get_rows` of n_embd_r = 24576
-floats (= (d_conv-1)*(d_inner + 2*n_group*d_state) = 3*8192) x 128 seqs, once per GDN layer per step.
-Decode-window `k_get_rows_float<float,float>` had a BIG cluster of ~720 instances (30 GDN x 24) at
-~115 us = ~3.4 ms/step (5.2 ms/step at steady ntg=128). grid (ne10=128, block_num_y=96) confirmed
-ne00=24576 == n_embd_r (the SSM n_embd_s=524288 gather is already fused by 0019).
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
+  MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+  GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.

-## Built (paged branch f32 default = 0026 hybrid default is f32)
-New op `ggml_ssm_conv_update_inplace_ids` (src[4]=ids, op_params[1]=rs_head): reads each seq's prior
-taps from cache[ids[s]] in-kernel (identity -> in place from conv_state_dst; non-identity -> disjoint
-scratch via ssm_conv_gather_nonident_kernel). Mirrors 0019. Files: ggml.h, ggml.c, ssm-conv.cu,
-ggml-cpu/ops.cpp, delta-net-base.cpp, tests/test-backend-ops.cpp. Build EXIT=0.
-
-## GATE - PASS
- test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS OK (new), SSM_CONV_UPDATE OK, SSM_CONV OK,
-  GATED_DELTA_NET OK, GET_ROWS OK.
- greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models:
-  dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd (== baseline).
- nsys: k_get_rows<float,float> 10174 -> 9454 (720 fewer), 186.3 -> 102.8 ms; conv gathers replaced
-  by 720 x ~1.1 us no-op gather. MoE npl128 783.9 t/s (step 163.3 ms vs 169.8 @0025), dense 377.3 t/s.
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
+- MoE   npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
+  step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.

 ## Artifacts
- DGX: commit `944636c` on branch `paged`; LEVER1_GATHER_RESULTS.md in llama tree; nsys
-  `/tmp/kgr_moe.nsys-rep` (before) + `/tmp/kgr_moe_after.nsys-rep` (after).
- LocalAI worktree: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch + LEVER1_GATHER_RESULTS.md.
- BOTH trees committed (-s). NOT pushed.
-
-## Next
-Ready for the rigorous same-session A/B decode bench (npl 32/128, dense + MoE, before/after on the
-same 0026 base). The kernel-elimination and bit-exactness are proven; the bench quantifies the lift.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_GATHER_RESULTS.md
@@ -108,3 +108,56 @@ shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path
 + 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.

 Assisted-by: Claude:opus-4.8 [Claude Code]
+
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
+
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
+
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
+
+| model             | base (0026)                      | lever1 (0028)                    | recorded baseline                |
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
+| q36-27b-nvfp4     | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
+
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
+
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
+
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 391    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 208.56    | 209.39      | +0.40% | -              |
+| 128 | 369.95    | 377.83      | +2.13% | 94.6% -> 96.6% |
+
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
+
+| npl | base S_TG | lever1 S_TG | delta  | vs vLLM 901    |
+|-----|-----------|-------------|--------|----------------|
+| 32  | 456.85    | 459.56      | +0.59% | -              |
+| 128 | 763.47    | 777.95      | +1.90% | 84.7% -> 86.3% |
+
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
+
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
+
+| kernel                          | base (0026)            | lever1 (0028)                                |
+|---------------------------------|------------------------|----------------------------------------------|
+| k_get_rows_float<float,float>   | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms                       |
+| delta                           |                        | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
+| ssm_conv_update(_ids)_f32       | 219.71 ms (update)     | 225.75 ms (update_ids, +6 ms)                |
+| ssm_conv_gather_nonident_kernel | -                      | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
+
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
+the -3.13 ms/step throughput delta at npl128.
+
+### Verdict (gather-bench)
+
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.