From 1b5ae227eb97f3668daa4e1e0bd4ae83f2a44a1c Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 01:29:44 +0000 Subject: [PATCH] docs(paged): reject GDN M5 QS-early phase Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 55 ++++++++++++ .../docs/PARITY_HANDOFF.md | 1 + .../docs/VLLM_PARITY_FINAL.md | 1 + .../docs/VLLM_PARITY_LEVER_MAP.md | 28 ++++++ ...026-07-01-gdn-m5-state-boundary-phase11.md | 88 +++++++++++++++---- 5 files changed, 157 insertions(+), 16 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 1d978c691..1af131907 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -928,3 +928,58 @@ Conclusion: - A future GDN prefill attempt should either share the `A/T` work across value slabs or switch to a different FLA-style chunk design; it should not repeat this env-gated two-slab M5 variant. + +## Phase 11 GDN M5 QS-Early Rejection + +Phase 11 tested a smaller C=16 M5 scheduling shortcut instead of reopening C32: +move the `QS = Qc * S0` state-boundary tensor-core pass earlier and keep it +default-off behind `GDN_M5_QS_EARLY=1`. + +Correctness artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_default.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_qs_early.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_default.md5` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_default.md5` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_qs_early.md5` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_qs_early.md5` + +Correctness result: + +- Default and QS-early paths matched canonical md5 exactly: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed. + +Performance artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_qs_early.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_qs_early.txt` + +Performance A/B: + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.67 | 355.60 | 2229.90 | +| MoE | QS-early | 512 | 4 | 32 | 2315.77 | 353.27 | 2220.16 | +| MoE | M5 base | 2048 | 4 | 32 | 2441.54 | 390.53 | 2416.80 | +| MoE | QS-early | 2048 | 4 | 32 | 2420.26 | 389.89 | 2395.94 | +| Dense | M5 base | 512 | 4 | 32 | 975.15 | 142.71 | 932.97 | +| Dense | QS-early | 512 | 4 | 32 | 968.23 | 144.24 | 927.17 | +| Dense | M5 base | 2048 | 4 | 32 | 1021.06 | 183.34 | 1012.04 | +| Dense | QS-early | 2048 | 4 | 32 | 1015.77 | 183.73 | 1006.88 | + +Rejected diff: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff` + +Conclusion: + +- Do not ship Phase 11 QS-early as implemented. +- Merely moving the QS state-boundary product earlier is not enough; it remains + an extra MMA pass and does not reduce the M5 critical path. +- The next GDN attempt should skip local scheduling-only changes and scope a + true shared-A/Ai blocked-solve or global-scratch design, with an explicit + scratch/synchronization cost model before coding. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 897f3bc46..dd4d52de3 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -174,6 +174,7 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga | BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) | | bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates | | Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 | +| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 | Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index 438855fd8..0b7afa6a6 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -173,6 +173,7 @@ products through tensor cores. The series chased that headroom. | BV block-occupancy A/B (tf32) | raise blocks/SM to test if occupancy is the bound | **REJECTED** (occupancy is NOT the bound; latency is wave-hidden) | two arms statistically equal: **1844 vs 1814 S_PP (-1.04%, within noise)** | GDNAB armA/armB | | bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan | | Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab | +| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab | **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2) intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index fe6e96d2c..f4dfe78ce 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -493,6 +493,34 @@ Artifacts: - `/home/mudler/bench/phase10_gdn_c32_slab/ab/` - `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff` +### Phase 11 GDN M5 QS-early update + +Phase 11 tested the smallest possible C=16 follow-up after the C32 slab +rejection: move the `QS = Qc * S0` state-boundary product earlier in the M5 +chunk loop behind `GDN_M5_QS_EARLY=1`. + +Result: + +- The candidate built on DGX and stayed md5-exact: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- It regressed S_PP slightly in both families: + MoE 2048 `2441.54 -> 2420.26`, dense 2048 `1021.06 -> 1015.77`. + +Decision: + +- **REJECT** QS-early. +- Do not add it to the LocalAI patch stack. +- A scheduling-only move that still performs the same QS MMA does not close the + GDN gap. The next GDN scope should be a real shared-A/Ai blocked-solve or + global-scratch design, not another local reorder. + +Artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff` + --- # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026) diff --git a/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md b/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md index 5c518c34d..6380e6a25 100644 --- a/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md +++ b/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md @@ -25,7 +25,7 @@ - Read-only: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` - Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/` -- [ ] **Step 1: Check DGX is free** +- [x] **Step 1: Check DGX is free** Run: @@ -46,7 +46,7 @@ compute=0 FREE... ``` -- [ ] **Step 2: Record source provenance** +- [x] **Step 2: Record source provenance** Run: @@ -58,7 +58,7 @@ git -C /home/mudler/_git/llama.cpp rev-parse HEAD Expected: clean llama.cpp fork and DGX mirror before source edits. -- [ ] **Step 3: Create artifact directory** +- [x] **Step 3: Create artifact directory** Run: @@ -74,7 +74,7 @@ Expected: command exits 0. - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` - Mirror: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu` -- [ ] **Step 1: Add an env selector** +- [x] **Step 1: Add an env selector** Add a static env flag near the existing `gdn_tc` selector: @@ -87,7 +87,7 @@ static const bool gdn_m5_qs_early = []{ Route it only for `S_v == 128 && n_tokens >= gdn_chunk_min && gdn_tc >= 4`. -- [ ] **Step 2: Add a template boolean for the candidate** +- [x] **Step 2: Add a template boolean for the candidate** Extend the chunked launch templates with a defaulted boolean, keeping existing call sites source-compatible: @@ -107,7 +107,7 @@ static void launch_gdn_chunked(...) Use the boolean only inside the M3/M5 code path. Existing launches must remain `launch_gdn_chunked<128, 16, TC_>(...)`. -- [ ] **Step 3: Move QS deposition earlier for candidate only** +- [x] **Step 3: Move QS deposition earlier for candidate only** In `gated_delta_net_chunked_cuda`, after the KS/RHS section and before the `solve A U = RHS` section, add a candidate-only QS pass: @@ -157,7 +157,7 @@ if constexpr (TC >= 2 && !QS_EARLY) { This is intentionally conservative. It should not change math order for the deposited QS values, only their scheduling relative to the solve/P build. -- [ ] **Step 4: Add a candidate launch arm** +- [x] **Step 4: Add a candidate launch arm** In `launch_gated_delta_net`, when `gdn_m5_qs_early && gdn_tc >= 4`, call: @@ -168,7 +168,7 @@ return; Default M5 must continue to call `launch_gdn_chunked<128, 16, 4>(...)`. -- [ ] **Step 5: Mirror to DGX and build** +- [x] **Step 5: Mirror to DGX and build** Run: @@ -180,12 +180,20 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . Expected: build exits 0. +Result: + +- Candidate implemented as a default-off `GDN_M5_QS_EARLY=1` path in the + llama.cpp fork. +- The patch only touched `ggml/src/ggml-cuda/gated_delta_net.cu`. +- DGX build passed for `test-backend-ops`, `llama-completion`, and + `llama-batched-bench`. + ## Task 3: Correctness Gates **Files:** - Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/` -- [ ] **Step 1: Run focused op gates** +- [x] **Step 1: Run focused op gates** Run: @@ -198,7 +206,7 @@ GDN_M5_QS_EARLY=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o G Expected: both logs show CUDA0 `OK` for all `GATED_DELTA_NET` cases. -- [ ] **Step 2: Run canonical md5 gates** +- [x] **Step 2: Run canonical md5 gates** Run: @@ -224,18 +232,31 @@ MoE 8cb0ce23777bf55f92f63d0292c756b0 Dense 5951a5b4d624ce891e22ab5fca9bc439 ``` -- [ ] **Step 3: Stop if md5 changes** +- [x] **Step 3: Stop if md5 changes** If either candidate md5 differs, do not benchmark yet. Run the KL gate from `backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md` and accept only if KL is benign and the transcript is sane. +Result: + +- Default op gate: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_default.txt`. +- QS-early op gate: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_qs_early.txt`. +- Both focused op logs reported the same OK count. +- Default md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- QS-early md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed because md5 matched canonical exactly. + ## Task 4: Performance A/B **Files:** - Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/` -- [ ] **Step 1: Run same-session MoE and dense A/B** +- [x] **Step 1: Run same-session MoE and dense A/B** Run: @@ -254,7 +275,7 @@ env $LCAND ./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf -c 131 Expected: candidate improves S_PP for at least the target MoE prefill cases and does not regress dense outside noise. -- [ ] **Step 2: Decide accept/reject** +- [x] **Step 2: Decide accept/reject** Accept only if: @@ -273,6 +294,35 @@ git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu Then restore fork and DGX mirror. +Result: + +Artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_qs_early.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_qs_early.txt` + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.67 | 355.60 | 2229.90 | +| MoE | QS-early | 512 | 4 | 32 | 2315.77 | 353.27 | 2220.16 | +| MoE | M5 base | 2048 | 4 | 32 | 2441.54 | 390.53 | 2416.80 | +| MoE | QS-early | 2048 | 4 | 32 | 2420.26 | 389.89 | 2395.94 | +| Dense | M5 base | 512 | 4 | 32 | 975.15 | 142.71 | 932.97 | +| Dense | QS-early | 512 | 4 | 32 | 968.23 | 144.24 | 927.17 | +| Dense | M5 base | 2048 | 4 | 32 | 1021.06 | 183.34 | 1012.04 | +| Dense | QS-early | 2048 | 4 | 32 | 1015.77 | 183.73 | 1006.88 | + +Decision: + +- Reject the QS-early source patch. +- The candidate is correctness-clean but does not improve S_PP and slightly + regresses both model families. +- Rejected diff saved at: + `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`. +- The llama.cpp fork and DGX mirror were restored to the prior accepted state. + ## Task 5: Mirror Accepted Patch or Record Rejection **Files:** @@ -282,7 +332,7 @@ Then restore fork and DGX mirror. - Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` - Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md` -- [ ] **Step 1: If accepted, commit fork patch** +- [x] **Step 1: If accepted, commit fork patch** Commit in `/home/mudler/_git/llama.cpp` only after gates pass: @@ -291,7 +341,7 @@ git add ggml/src/ggml-cuda/gated_delta_net.cu git commit -m "feat(cuda): add gated delta net M5 QS-early path" ``` -- [ ] **Step 2: Generate LocalAI patch** +- [x] **Step 2: Generate LocalAI patch** Run: @@ -302,7 +352,7 @@ git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD \ Do not hand-edit the generated patch. -- [ ] **Step 3: Update docs and commit LocalAI** +- [x] **Step 3: Update docs and commit LocalAI** Record artifacts, md5/KL results, A/B numbers, and the decision. Commit with: @@ -327,3 +377,9 @@ git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ git commit -m "docs(paged): record GDN M5 QS-early result" \ -m "Assisted-by: Codex:gpt-5" ``` + +Result: + +- No fork commit and no LocalAI `0055` patch were generated because the + candidate failed the performance gate. +- Phase 11 is a docs-only rejection record.