docs(paged): reject GDN global Ai32 prototype

Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 01:51:53 +00:00
parent adabd11919
commit 2074b4fb5b
7 changed files with 215 additions and 30 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1025,3 +1025,63 @@ Decision:
 - Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
 - The prototype must be rejected if it is flat or slower; do not iterate into
  f16/BF16 Ai unless f32 proves the schedule can win.
+
+## Phase 13 GDN Global-Ai32 Prototype Rejection
+
+Phase 13 implemented the Phase 12 design in the llama.cpp fork as a default-off
+prototype behind `GDN_GLOBAL_AI32=1`.
+
+Implementation summary:
+
+- Added a f32 Ai precompute kernel.
+- Added C32, `dv_tile=64` slab consumption through the chunked GDN path.
+- Allocated Ai scratch from the ggml CUDA pool only for supported calls.
+- Kept the default C16 M5 path unchanged.
+
+Correctness artifacts:
+
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_default.txt`
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_global_ai32.txt`
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_default.md5`
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_default.md5`
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_global_ai32.md5`
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_global_ai32.md5`
+
+Correctness result:
+
+- Default and Global-Ai32 paths matched canonical md5 exactly:
+  - MoE `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense `5951a5b4d624ce891e22ab5fca9bc439`.
+- KL was not needed.
+
+Performance artifacts:
+
+- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt`
+- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt`
+- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt`
+- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt`
+
+Performance A/B:
+
+| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
+|-------|------|----|----|---|----------|----------|-------|
+| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 |
+| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 |
+| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 |
+| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 |
+| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 |
+| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 |
+| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 |
+| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 |
+
+Rejected diff:
+
+- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
+
+Conclusion:
+
+- Do not ship Phase 13 Global-Ai32 as implemented.
+- The global scratch split is correctness-safe but slower than shipped C16 M5.
+- Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining
+  vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware
+  assumptions that do not fit this GB10 patch stack without a regression.
--- a/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md
@@ -140,3 +140,33 @@ Phase 13 constraints:
 - If md5 changes, run KL before benchmarking.
 - If the prototype is flat or slower, reject it and stop GDN kernel work on
  GB10; do not iterate into f16 Ai until f32 proves the schedule can win.
+
+## Phase 13 Result
+
+Phase 13 implemented the f32 Global-Ai32 prototype and rejected it.
+
+Correctness:
+
+- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+
+Performance:
+
+| Model | Mode | PP | S_PP t/s |
+|-------|------|----|----------|
+| MoE | M5 base | 2048 | 2425.10 |
+| MoE | Global Ai32 | 2048 | 2097.76 |
+| Dense | M5 base | 2048 | 1016.14 |
+| Dense | Global Ai32 | 2048 | 918.19 |
+
+Artifacts:
+
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
+- `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
+- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
+
+Final decision:
+
+- Reject Global-Ai32.
+- Stop GDN kernel work on GB10. The remaining vLLM GDN advantage is not
+  reachable through the low-conflict C16/C32 patch shapes tested here.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -176,12 +176,13 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
 | Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
 | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
 | Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |
+| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | REJECTED | md5-clean, but slower: MoE 2048 2425.10 -> 2097.76; dense 2048 1016.14 -> 918.19 |

 Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.

-Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
-`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
-on GB10 instead of iterating into f16 Ai or more local reorders.
+Phase 13 closes the caveat: the default-off `GDN_GLOBAL_AI32=1` prototype was
+correctness-clean but slower. Stop GDN kernel work on GB10 instead of iterating
+into f16 Ai or more local reorders.

 ### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
 | Lever | What | Verdict | Key number |
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
@@ -175,6 +175,7 @@ products through tensor cores. The series chased that headroom.
 | Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
 | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
 | Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |
+| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | **REJECTED** | md5-clean, but S_PP regressed: MoE 2048 **2425.10 -> 2097.76**, dense 2048 **1016.14 -> 918.19** | phase13 gates/ab |

 **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
 intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
@@ -186,11 +187,10 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
 that and is the shipped winner; it does not fully close the 2.62x because vLLM's
 mature FLA blocked-solve is a more complete tensor-core implementation.

-Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
-default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
-but not automatically disqualifying. If that prototype is flat or slower, GDN
-kernel work on GB10 should stop rather than moving to f16 Ai or additional
-local reorders.
+Post-record caveat closed: Phase 13 tested the one permitted
+`GDN_GLOBAL_AI32=1` prototype. It was correctness-clean but slower, so GDN kernel
+work on GB10 should stop rather than moving to f16 Ai or additional local
+reorders.

 ### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -549,6 +549,36 @@ Docs:
 - `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
 - `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`

+### Phase 13 GDN Global-Ai32 update
+
+Phase 13 implemented the Phase 12 prototype behind `GDN_GLOBAL_AI32=1`:
+precompute f32 Ai once per chunk/head, then consume it from two C32
+`dv_tile=64` value slabs.
+
+Result:
+
+- Correctness passed:
+  MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+  `5951a5b4d624ce891e22ab5fca9bc439`.
+- Performance regressed:
+  - MoE 2048 S_PP `2425.10 -> 2097.76`.
+  - Dense 2048 S_PP `1016.14 -> 918.19`.
+
+Decision:
+
+- **REJECT** Global-Ai32.
+- Do not add `0055`.
+- Stop GDN kernel work on GB10. The shortcut space is exhausted by Phase 10,
+  Phase 11, and Phase 13 evidence; further GDN parity work needs a different
+  hardware regime or a larger FLA/CuteDSL-class implementation outside this
+  low-conflict LocalAI patch stack.
+
+Artifacts:
+
+- `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
+- `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
+- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
+
 ---

 # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)