mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention
# Conflicts: # gallery/index.yaml
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
From 944636cf34b486d4035575e48845840368de0743 Mon Sep 17 00:00:00 2001
|
||||
From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Fri, 26 Jun 2026 22:58:47 +0200
|
||||
Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
|
||||
@@ -46,22 +46,56 @@ MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
LEVER1_GATHER_RESULTS.md | 110 +++++++++++++++++++++++
|
||||
ggml/include/ggml.h | 20 +++++
|
||||
ggml/src/ggml-cpu/ops.cpp | 90 ++++++++++++++++++-
|
||||
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++++-
|
||||
LEVER1_GATHER_PROGRESS.md | 26 ++++++
|
||||
LEVER1_GATHER_RESULTS.md | 163 +++++++++++++++++++++++++++++++++
|
||||
ggml/include/ggml.h | 20 ++++
|
||||
ggml/src/ggml-cpu/ops.cpp | 90 +++++++++++++++++-
|
||||
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
|
||||
ggml/src/ggml.c | 62 +++++++++++++
|
||||
src/models/delta-net-base.cpp | 26 ++++--
|
||||
tests/test-backend-ops.cpp | 69 +++++++++++++++
|
||||
7 files changed, 521 insertions(+), 11 deletions(-)
|
||||
tests/test-backend-ops.cpp | 69 ++++++++++++++
|
||||
8 files changed, 600 insertions(+), 11 deletions(-)
|
||||
create mode 100644 LEVER1_GATHER_PROGRESS.md
|
||||
create mode 100644 LEVER1_GATHER_RESULTS.md
|
||||
|
||||
diff --git a/LEVER1_GATHER_PROGRESS.md b/LEVER1_GATHER_PROGRESS.md
|
||||
new file mode 100644
|
||||
index 0000000..e4d14b9
|
||||
--- /dev/null
|
||||
+++ b/LEVER1_GATHER_PROGRESS.md
|
||||
@@ -0,0 +1,26 @@
|
||||
+# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
|
||||
+
|
||||
+STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
|
||||
+
|
||||
+## What
|
||||
+Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
|
||||
+update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
|
||||
+0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
|
||||
+(read path gather -> indexed in-kernel read; values + reduction order unchanged).
|
||||
+
|
||||
+## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
|
||||
+- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
|
||||
+ MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
|
||||
+- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
|
||||
+ GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
|
||||
+
|
||||
+## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
|
||||
+- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
|
||||
+- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
|
||||
+- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
|
||||
+ step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
|
||||
+
|
||||
+## Artifacts
|
||||
+- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
|
||||
+- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
|
||||
+- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
|
||||
diff --git a/LEVER1_GATHER_RESULTS.md b/LEVER1_GATHER_RESULTS.md
|
||||
new file mode 100644
|
||||
index 0000000..c78e3c0
|
||||
index 0000000..afced02
|
||||
--- /dev/null
|
||||
+++ b/LEVER1_GATHER_RESULTS.md
|
||||
@@ -0,0 +1,110 @@
|
||||
@@ -0,0 +1,163 @@
|
||||
+# Patch 0028: qwen35 recurrent-state gather fusion (Lever 1, bit-exact)
|
||||
+
|
||||
+The MoE-gap groundtruth (`MOE_GAP_VS_VLLM.md`) found `k_get_rows_float` to be the single biggest
|
||||
@@ -172,6 +206,59 @@ index 0000000..c78e3c0
|
||||
++ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
|
||||
+
|
||||
+Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
+
|
||||
+## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
|
||||
+
|
||||
+Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
|
||||
+NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
|
||||
+
|
||||
+### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
|
||||
+
|
||||
+| model | base (0026) | lever1 (0028) | recorded baseline |
|
||||
+|-------------------|----------------------------------|----------------------------------|----------------------------------|
|
||||
+| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
|
||||
+| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
|
||||
+
|
||||
+test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
|
||||
+GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
|
||||
+
|
||||
+### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
|
||||
+
|
||||
+dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
|
||||
+
|
||||
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
|
||||
+|-----|-----------|-------------|--------|----------------|
|
||||
+| 32 | 208.56 | 209.39 | +0.40% | - |
|
||||
+| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
|
||||
+
|
||||
+MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
|
||||
+
|
||||
+| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
|
||||
+|-----|-----------|-------------|--------|----------------|
|
||||
+| 32 | 456.85 | 459.56 | +0.59% | - |
|
||||
+| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
|
||||
+
|
||||
+Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
|
||||
+
|
||||
+### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
|
||||
+
|
||||
+| kernel | base (0026) | lever1 (0028) |
|
||||
+|---------------------------------|------------------------|----------------------------------------------|
|
||||
+| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
|
||||
+| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
|
||||
+| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
|
||||
+| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
|
||||
+
|
||||
+The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
|
||||
+into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
|
||||
+the -3.13 ms/step throughput delta at npl128.
|
||||
+
|
||||
+### Verdict (gather-bench)
|
||||
+
|
||||
+Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
|
||||
+gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
|
||||
+throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
|
||||
+MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
|
||||
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
|
||||
index 2a5cbce..5fa220a 100644
|
||||
--- a/ggml/include/ggml.h
|
||||
|
||||
@@ -1,42 +1,26 @@
|
||||
# LEVER1_GATHER_PROGRESS.md - gather-build GPU agent checkpoint
|
||||
# Lever 1 (residual recurrent-state gather fusion) - PROGRESS / gather-bench DONE
|
||||
|
||||
Status: **DONE.** Residual k_get_rows fused in-place, bit-exact, both gates pass. Patch 0028.
|
||||
STATUS: COMPLETE. Bit-exact, both gates green, rigorous same-session A/B bench done, committed both trees.
|
||||
|
||||
## Lever
|
||||
Fuse the residual `k_get_rows_float` in the GDN decode path (the biggest single kernel vLLM lacks,
|
||||
~5.2 ms/step MoE per MOE_GAP_VS_VLLM.md). 0019 fused the SSM-state gather; 0021 fused the conv
|
||||
compute but kept a `build_rs` gather for the conv taps. This patch closes that last gather.
|
||||
## What
|
||||
Fused the residual conv-state tap k_get_rows (build_conv_state_fused) in-place into the SSM_CONV
|
||||
update via ggml_ssm_conv_update_inplace_ids (src[4]=ids discriminator). Mirrors 0019 (SSM-state) +
|
||||
0018 (in-place). Eliminates the last k_get_rows in the GDN decode path. Bit-exact by construction
|
||||
(read path gather -> indexed in-kernel read; values + reduction order unchanged).
|
||||
|
||||
## Located (nsys, DGX GB10, MoE q36-35b-a3b-nvfp4, npp128 ntg24 npl128)
|
||||
The residual is the **conv-state tap gather** in `build_conv_state_fused`
|
||||
(`src/models/delta-net-base.cpp`): the plain 4-arg `build_rs` -> `ggml_get_rows` of n_embd_r = 24576
|
||||
floats (= (d_conv-1)*(d_inner + 2*n_group*d_state) = 3*8192) x 128 seqs, once per GDN layer per step.
|
||||
Decode-window `k_get_rows_float<float,float>` had a BIG cluster of ~720 instances (30 GDN x 24) at
|
||||
~115 us = ~3.4 ms/step (5.2 ms/step at steady ntg=128). grid (ne10=128, block_num_y=96) confirmed
|
||||
ne00=24576 == n_embd_r (the SSM n_embd_s=524288 gather is already fused by 0019).
|
||||
## Gates (lever1 build = build-cuda, base = build-cuda-base = 0026)
|
||||
- md5 greedy --temp 0 --seed 1 -n 48: dense 5951a5b4d624ce891e22ab5fca9bc439 == baseline;
|
||||
MoE 07db32c2bcb78d17a43ed18bc22705cd == baseline; base == lever1 (byte-identical).
|
||||
- test-backend-ops CUDA0: SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
|
||||
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
|
||||
|
||||
## Built (paged branch f32 default = 0026 hybrid default is f32)
|
||||
New op `ggml_ssm_conv_update_inplace_ids` (src[4]=ids, op_params[1]=rs_head): reads each seq's prior
|
||||
taps from cache[ids[s]] in-kernel (identity -> in place from conv_state_dst; non-identity -> disjoint
|
||||
scratch via ssm_conv_gather_nonident_kernel). Mirrors 0019. Files: ggml.h, ggml.c, ssm-conv.cu,
|
||||
ggml-cpu/ops.cpp, delta-net-base.cpp, tests/test-backend-ops.cpp. Build EXIT=0.
|
||||
|
||||
## GATE - PASS
|
||||
- test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS OK (new), SSM_CONV_UPDATE OK, SSM_CONV OK,
|
||||
GATED_DELTA_NET OK, GET_ROWS OK.
|
||||
- greedy md5 (-temp 0 -seed 1 -n 48) BYTE-IDENTICAL both models:
|
||||
dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd (== baseline).
|
||||
- nsys: k_get_rows<float,float> 10174 -> 9454 (720 fewer), 186.3 -> 102.8 ms; conv gathers replaced
|
||||
by 720 x ~1.1 us no-op gather. MoE npl128 783.9 t/s (step 163.3 ms vs 169.8 @0025), dense 377.3 t/s.
|
||||
## Bench (S_TG t/s, npp128 ntg128 npl 32/128)
|
||||
- dense npl128 369.95 -> 377.83 (+2.13%, 94.6 -> 96.6% of vLLM 391); npl32 208.56 -> 209.39.
|
||||
- MoE npl128 763.47 -> 777.95 (+1.90%, 84.7 -> 86.3% of vLLM 901); npl32 456.85 -> 459.56.
|
||||
- nsys MoE decode: k_get_rows_float 17334 -> 15414 inst (-1920 = 30 GDN x 64 steps), 358.37 -> 133.52 ms;
|
||||
step 167.7 -> 164.5 ms (-3.13 ms/step). gather eliminated, replaced by no-op nonident kernel.
|
||||
|
||||
## Artifacts
|
||||
- DGX: commit `944636c` on branch `paged`; LEVER1_GATHER_RESULTS.md in llama tree; nsys
|
||||
`/tmp/kgr_moe.nsys-rep` (before) + `/tmp/kgr_moe_after.nsys-rep` (after).
|
||||
- LocalAI worktree: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch + LEVER1_GATHER_RESULTS.md.
|
||||
- BOTH trees committed (-s). NOT pushed.
|
||||
|
||||
## Next
|
||||
Ready for the rigorous same-session A/B decode bench (npl 32/128, dense + MoE, before/after on the
|
||||
same 0026 base). The kernel-elimination and bit-exactness are proven; the bench quantifies the lift.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
- Patch: patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch (LocalAI worktree)
|
||||
- Docs: LEVER1_GATHER_RESULTS.md (full bench tables)
|
||||
- DGX bench outs: ab_{dense,moe}_{base,lever1}.out, nab_{base,lever1}.kern.csv, md5{d,m}_{base,lever1}.txt
|
||||
|
||||
@@ -108,3 +108,56 @@ shared GDN conv path). This closes the last `k_get_rows` in the GDN decode path
|
||||
+ 0021 conv compute). Additive and risk-free; ready for the rigorous same-session A/B bench.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
## Rigorous same-session A/B bench (label gather-bench, DGX GB10 sm_121)
|
||||
|
||||
Independently re-validated on a fresh GPU session. `build-cuda-base` = pre-lever-1 (0026, 33e7c65;
|
||||
NO `ssm_conv_update_ids` symbol) vs `build-cuda` = lever-1 (this commit; WITH it). Same env, back-to-back.
|
||||
|
||||
### Gate re-confirm (greedy --temp 0 --seed 1 -n 48, -fa on) - base == lever1 == recorded baseline
|
||||
|
||||
| model | base (0026) | lever1 (0028) | recorded baseline |
|
||||
|-------------------|----------------------------------|----------------------------------|----------------------------------|
|
||||
| q36-27b-nvfp4 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 | 5951a5b4d624ce891e22ab5fca9bc439 |
|
||||
| q36-35b-a3b-nvfp4 | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd | 07db32c2bcb78d17a43ed18bc22705cd |
|
||||
|
||||
test-backend-ops (CUDA0): SSM_CONV_UPDATE_IDS 16/16, SSM_CONV_UPDATE 16/16, SSM_CONV 45/45,
|
||||
GATED_DELTA_NET 84/84, GET_ROWS 47/47 - all OK.
|
||||
|
||||
### decode_agg (S_TG t/s) before/after, npp128 ntg128 -npl 32,128 -c 33000
|
||||
|
||||
dense q36-27b-nvfp4 (LLAMA_KV_PAGED=1):
|
||||
|
||||
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 391 |
|
||||
|-----|-----------|-------------|--------|----------------|
|
||||
| 32 | 208.56 | 209.39 | +0.40% | - |
|
||||
| 128 | 369.95 | 377.83 | +2.13% | 94.6% -> 96.6% |
|
||||
|
||||
MoE q36-35b-a3b-nvfp4 (LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1):
|
||||
|
||||
| npl | base S_TG | lever1 S_TG | delta | vs vLLM 901 |
|
||||
|-----|-----------|-------------|--------|----------------|
|
||||
| 32 | 456.85 | 459.56 | +0.59% | - |
|
||||
| 128 | 763.47 | 777.95 | +1.90% | 84.7% -> 86.3% |
|
||||
|
||||
Step time npl128: dense 346.0 -> 338.8 ms/batch-step, MoE 167.7 -> 164.5 ms/step (-3.13 ms/step).
|
||||
|
||||
### nsys (MoE decode, npp128 ntg64 npl128, same env) - k_get_rows eliminated
|
||||
|
||||
| kernel | base (0026) | lever1 (0028) |
|
||||
|---------------------------------|------------------------|----------------------------------------------|
|
||||
| k_get_rows_float<float,float> | 17334 inst / 358.37 ms | 15414 inst / 133.52 ms |
|
||||
| delta | | -1920 inst (= 30 GDN x 64 steps), -224.85 ms |
|
||||
| ssm_conv_update(_ids)_f32 | 219.71 ms (update) | 225.75 ms (update_ids, +6 ms) |
|
||||
| ssm_conv_gather_nonident_kernel | - | 1920 x ~1.13 us = 2.17 ms (no-op, all ident) |
|
||||
|
||||
The 1920 big ~114 us conv-tap gathers are gone; only the ~1.13 us no-op gather kernel + ~6 ms folded
|
||||
into the update kernel are added. Net GDN get_rows saving ~216 ms / 64 steps = ~3.4 ms/step, matching
|
||||
the -3.13 ms/step throughput delta at npl128.
|
||||
|
||||
### Verdict (gather-bench)
|
||||
|
||||
Bit-exact (gate re-confirmed, both md5 byte-identical to baseline), the residual k_get_rows conv
|
||||
gather is independently nsys-confirmed eliminated (-1920 inst, -224.85 ms over 64 steps), and decode
|
||||
throughput lifts BOTH models in the same session: dense npl128 +2.13% (94.6 -> 96.6% of vLLM),
|
||||
MoE npl128 +1.90% (84.7 -> 86.3% of vLLM). Ship it.
|
||||
|
||||
Reference in New Issue
Block a user