docs(paged): reject GDN global Ai32 prototype

Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 01:51:53 +00:00
parent adabd11919
commit 2074b4fb5b
7 changed files with 215 additions and 30 deletions

View File

@@ -1025,3 +1025,63 @@ Decision:
- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
- The prototype must be rejected if it is flat or slower; do not iterate into
f16/BF16 Ai unless f32 proves the schedule can win.
## Phase 13 GDN Global-Ai32 Prototype Rejection
Phase 13 implemented the Phase 12 design in the llama.cpp fork as a default-off
prototype behind `GDN_GLOBAL_AI32=1`.
Implementation summary:
- Added a f32 Ai precompute kernel.
- Added C32, `dv_tile=64` slab consumption through the chunked GDN path.
- Allocated Ai scratch from the ggml CUDA pool only for supported calls.
- Kept the default C16 M5 path unchanged.
Correctness artifacts:
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_default.txt`
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_global_ai32.txt`
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_default.md5`
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_default.md5`
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_global_ai32.md5`
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_global_ai32.md5`
Correctness result:
- Default and Global-Ai32 paths matched canonical md5 exactly:
- MoE `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense `5951a5b4d624ce891e22ab5fca9bc439`.
- KL was not needed.
Performance artifacts:
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt`
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt`
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt`
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt`
Performance A/B:
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|-------|------|----|----|---|----------|----------|-------|
| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 |
| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 |
| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 |
| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 |
| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 |
| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 |
| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 |
| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 |
Rejected diff:
- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
Conclusion:
- Do not ship Phase 13 Global-Ai32 as implemented.
- The global scratch split is correctness-safe but slower than shipped C16 M5.
- Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining
vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware
assumptions that do not fit this GB10 patch stack without a regression.

View File

@@ -140,3 +140,33 @@ Phase 13 constraints:
- If md5 changes, run KL before benchmarking.
- If the prototype is flat or slower, reject it and stop GDN kernel work on
GB10; do not iterate into f16 Ai until f32 proves the schedule can win.
## Phase 13 Result
Phase 13 implemented the f32 Global-Ai32 prototype and rejected it.
Correctness:
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
Performance:
| Model | Mode | PP | S_PP t/s |
|-------|------|----|----------|
| MoE | M5 base | 2048 | 2425.10 |
| MoE | Global Ai32 | 2048 | 2097.76 |
| Dense | M5 base | 2048 | 1016.14 |
| Dense | Global Ai32 | 2048 | 918.19 |
Artifacts:
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
Final decision:
- Reject Global-Ai32.
- Stop GDN kernel work on GB10. The remaining vLLM GDN advantage is not
reachable through the low-conflict C16/C32 patch shapes tested here.

View File

@@ -176,12 +176,13 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |
| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | REJECTED | md5-clean, but slower: MoE 2048 2425.10 -> 2097.76; dense 2048 1016.14 -> 918.19 |
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.
Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
on GB10 instead of iterating into f16 Ai or more local reorders.
Phase 13 closes the caveat: the default-off `GDN_GLOBAL_AI32=1` prototype was
correctness-clean but slower. Stop GDN kernel work on GB10 instead of iterating
into f16 Ai or more local reorders.
### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
| Lever | What | Verdict | Key number |

View File

@@ -175,6 +175,7 @@ products through tensor cores. The series chased that headroom.
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |
| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | **REJECTED** | md5-clean, but S_PP regressed: MoE 2048 **2425.10 -> 2097.76**, dense 2048 **1016.14 -> 918.19** | phase13 gates/ab |
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
@@ -186,11 +187,10 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
that and is the shipped winner; it does not fully close the 2.62x because vLLM's
mature FLA blocked-solve is a more complete tensor-core implementation.
Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
but not automatically disqualifying. If that prototype is flat or slower, GDN
kernel work on GB10 should stop rather than moving to f16 Ai or additional
local reorders.
Post-record caveat closed: Phase 13 tested the one permitted
`GDN_GLOBAL_AI32=1` prototype. It was correctness-clean but slower, so GDN kernel
work on GB10 should stop rather than moving to f16 Ai or additional local
reorders.
### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)

View File

@@ -549,6 +549,36 @@ Docs:
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
### Phase 13 GDN Global-Ai32 update
Phase 13 implemented the Phase 12 prototype behind `GDN_GLOBAL_AI32=1`:
precompute f32 Ai once per chunk/head, then consume it from two C32
`dv_tile=64` value slabs.
Result:
- Correctness passed:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`.
- Performance regressed:
- MoE 2048 S_PP `2425.10 -> 2097.76`.
- Dense 2048 S_PP `1016.14 -> 918.19`.
Decision:
- **REJECT** Global-Ai32.
- Do not add `0055`.
- Stop GDN kernel work on GB10. The shortcut space is exhausted by Phase 10,
Phase 11, and Phase 13 evidence; further GDN parity work needs a different
hardware regime or a larger FLA/CuteDSL-class implementation outside this
low-conflict LocalAI patch stack.
Artifacts:
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
---
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)