mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): reject GDN global Ai32 prototype
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -1025,3 +1025,63 @@ Decision:
|
||||
- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
|
||||
- The prototype must be rejected if it is flat or slower; do not iterate into
|
||||
f16/BF16 Ai unless f32 proves the schedule can win.
|
||||
|
||||
## Phase 13 GDN Global-Ai32 Prototype Rejection
|
||||
|
||||
Phase 13 implemented the Phase 12 design in the llama.cpp fork as a default-off
|
||||
prototype behind `GDN_GLOBAL_AI32=1`.
|
||||
|
||||
Implementation summary:
|
||||
|
||||
- Added a f32 Ai precompute kernel.
|
||||
- Added C32, `dv_tile=64` slab consumption through the chunked GDN path.
|
||||
- Allocated Ai scratch from the ggml CUDA pool only for supported calls.
|
||||
- Kept the default C16 M5 path unchanged.
|
||||
|
||||
Correctness artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_default.txt`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_global_ai32.txt`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_default.md5`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_default.md5`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_global_ai32.md5`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_global_ai32.md5`
|
||||
|
||||
Correctness result:
|
||||
|
||||
- Default and Global-Ai32 paths matched canonical md5 exactly:
|
||||
- MoE `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- KL was not needed.
|
||||
|
||||
Performance artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt`
|
||||
|
||||
Performance A/B:
|
||||
|
||||
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|
||||
|-------|------|----|----|---|----------|----------|-------|
|
||||
| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 |
|
||||
| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 |
|
||||
| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 |
|
||||
| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 |
|
||||
| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 |
|
||||
| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 |
|
||||
| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 |
|
||||
| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 |
|
||||
|
||||
Rejected diff:
|
||||
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
|
||||
|
||||
Conclusion:
|
||||
|
||||
- Do not ship Phase 13 Global-Ai32 as implemented.
|
||||
- The global scratch split is correctness-safe but slower than shipped C16 M5.
|
||||
- Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining
|
||||
vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware
|
||||
assumptions that do not fit this GB10 patch stack without a regression.
|
||||
|
||||
@@ -140,3 +140,33 @@ Phase 13 constraints:
|
||||
- If md5 changes, run KL before benchmarking.
|
||||
- If the prototype is flat or slower, reject it and stop GDN kernel work on
|
||||
GB10; do not iterate into f16 Ai until f32 proves the schedule can win.
|
||||
|
||||
## Phase 13 Result
|
||||
|
||||
Phase 13 implemented the f32 Global-Ai32 prototype and rejected it.
|
||||
|
||||
Correctness:
|
||||
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
|
||||
Performance:
|
||||
|
||||
| Model | Mode | PP | S_PP t/s |
|
||||
|-------|------|----|----------|
|
||||
| MoE | M5 base | 2048 | 2425.10 |
|
||||
| MoE | Global Ai32 | 2048 | 2097.76 |
|
||||
| Dense | M5 base | 2048 | 1016.14 |
|
||||
| Dense | Global Ai32 | 2048 | 918.19 |
|
||||
|
||||
Artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
|
||||
|
||||
Final decision:
|
||||
|
||||
- Reject Global-Ai32.
|
||||
- Stop GDN kernel work on GB10. The remaining vLLM GDN advantage is not
|
||||
reachable through the low-conflict C16/C32 patch shapes tested here.
|
||||
|
||||
@@ -176,12 +176,13 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
|
||||
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
|
||||
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
|
||||
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |
|
||||
| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | REJECTED | md5-clean, but slower: MoE 2048 2425.10 -> 2097.76; dense 2048 1016.14 -> 918.19 |
|
||||
|
||||
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.
|
||||
|
||||
Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
|
||||
`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
|
||||
on GB10 instead of iterating into f16 Ai or more local reorders.
|
||||
Phase 13 closes the caveat: the default-off `GDN_GLOBAL_AI32=1` prototype was
|
||||
correctness-clean but slower. Stop GDN kernel work on GB10 instead of iterating
|
||||
into f16 Ai or more local reorders.
|
||||
|
||||
### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
|
||||
| Lever | What | Verdict | Key number |
|
||||
|
||||
@@ -175,6 +175,7 @@ products through tensor cores. The series chased that headroom.
|
||||
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
|
||||
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
|
||||
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |
|
||||
| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | **REJECTED** | md5-clean, but S_PP regressed: MoE 2048 **2425.10 -> 2097.76**, dense 2048 **1016.14 -> 918.19** | phase13 gates/ab |
|
||||
|
||||
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
|
||||
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
|
||||
@@ -186,11 +187,10 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
|
||||
that and is the shipped winner; it does not fully close the 2.62x because vLLM's
|
||||
mature FLA blocked-solve is a more complete tensor-core implementation.
|
||||
|
||||
Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
|
||||
default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
|
||||
but not automatically disqualifying. If that prototype is flat or slower, GDN
|
||||
kernel work on GB10 should stop rather than moving to f16 Ai or additional
|
||||
local reorders.
|
||||
Post-record caveat closed: Phase 13 tested the one permitted
|
||||
`GDN_GLOBAL_AI32=1` prototype. It was correctness-clean but slower, so GDN kernel
|
||||
work on GB10 should stop rather than moving to f16 Ai or additional local
|
||||
reorders.
|
||||
|
||||
### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)
|
||||
|
||||
|
||||
@@ -549,6 +549,36 @@ Docs:
|
||||
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
|
||||
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
|
||||
|
||||
### Phase 13 GDN Global-Ai32 update
|
||||
|
||||
Phase 13 implemented the Phase 12 prototype behind `GDN_GLOBAL_AI32=1`:
|
||||
precompute f32 Ai once per chunk/head, then consume it from two C32
|
||||
`dv_tile=64` value slabs.
|
||||
|
||||
Result:
|
||||
|
||||
- Correctness passed:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Performance regressed:
|
||||
- MoE 2048 S_PP `2425.10 -> 2097.76`.
|
||||
- Dense 2048 S_PP `1016.14 -> 918.19`.
|
||||
|
||||
Decision:
|
||||
|
||||
- **REJECT** Global-Ai32.
|
||||
- Do not add `0055`.
|
||||
- Stop GDN kernel work on GB10. The shortcut space is exhausted by Phase 10,
|
||||
Phase 11, and Phase 13 evidence; further GDN parity work needs a different
|
||||
hardware regime or a larger FLA/CuteDSL-class implementation outside this
|
||||
low-conflict LocalAI patch stack.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
|
||||
|
||||
---
|
||||
|
||||
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
|
||||
|
||||
Reference in New Issue
Block a user