docs(paged): scope GDN global Ai32 prototype

Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 01:38:51 +00:00
parent 1b5ae227eb
commit adabd11919
9 changed files with 1159 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -983,3 +983,45 @@ Conclusion:
 - The next GDN attempt should skip local scheduling-only changes and scope a
  true shared-A/Ai blocked-solve or global-scratch design, with an explicit
  scratch/synchronization cost model before coding.
+
+## Phase 12 GDN Shared-A/Ai Cost Model
+
+Phase 12 evaluated whether a real shared-A/Ai design is credible enough to
+prototype after the C32 slab and QS-early shortcut rejections.
+
+Cost-model doc:
+
+- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
+
+Metadata artifact:
+
+- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
+
+Model dimensions:
+
+| Model | GDN layers | H | S_v | Metadata basis |
+|-------|------------|---|-----|----------------|
+| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` |
+| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` |
+
+Dynamic-smem result for `S_v=128`:
+
+| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
+|-------|-------|-----|-------------------------|
+| C16 full-width | 93,376 | 91.19 | yes |
+| C32 full-width | 127,360 | 124.38 | no |
+| C32 slab64 + U staging | 94,592 | 92.38 | yes |
+
+Ai scratch result at `npp=2048,npl=32,BT=32,f32`:
+
+| Model | Ai scratch MiB | 3x Ai traffic MiB |
+|-------|----------------|-------------------|
+| MoE | 256.0 | 768.0 |
+| Dense | 384.0 | 1152.0 |
+
+Decision:
+
+- GO for a default-off Phase 13 global-Ai32 prototype.
+- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
+- The prototype must be rejected if it is flat or slower; do not iterate into
+  f16/BF16 Ai unless f32 proves the schedule can win.
--- a/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md
@@ -0,0 +1,142 @@
+# GDN Shared-A/Ai Cost Model
+
+Phase 12 decides whether the next GDN prefill attempt should implement a
+shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10.
+
+## Reference Points
+
+llama.cpp:
+
+- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+  - `gated_delta_net_chunked_cuda`
+  - `launch_gdn_chunked`
+  - `launch_gated_delta_net`
+  - `ggml_cuda_op_gated_delta_net`
+
+vLLM/FLA:
+
+- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
+  - `chunk_gated_delta_rule_fwd`
+- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
+  - `solve_tril`
+  - `solve_tril_16x16_kernel`
+  - `merge_16x16_to_32x32_inverse_kernel`
+  - `merge_16x16_to_64x64_inverse_kernel`
+- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
+  - `recompute_w_u_fwd`
+
+## Metadata
+
+DGX metadata artifact:
+
+- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
+
+GGUF metadata:
+
+| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads |
+|-------|------|--------|--------------------|------------|-----------|-----------|-----------|
+| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred |
+| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred |
+
+Notes:
+
+- `GDN heads = ssm.inner_size / ssm.state_size`.
+- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal
+  layers, with 30 GDN layers at interval 4.
+- Dense has 64 layers, 48 GDN layers at interval 4.
+
+## Dynamic Shared Memory
+
+Formula:
+
+```text
+C16 full-width current M5:
+  floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
+
+C32 full-width:
+  floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
+
+C32 slab64 with U staging:
+  floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
+```
+
+For `S_v=128`:
+
+| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
+|-------|-------|-----|-------------------------|
+| C16 full-width | 93,376 | 91.19 | yes |
+| C32 full-width | 127,360 | 124.38 | no |
+| C32 slab64 + U staging | 94,592 | 92.38 | yes |
+
+Implication:
+
+- C32 full-width cannot be a single current-style CTA on GB10.
+- C32 only fits by splitting value columns or by changing state residency.
+- Splitting value columns must share A/Ai or it repeats the Phase 10 failure.
+
+## Ai Scratch Size
+
+Formula:
+
+```text
+Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
+```
+
+Benchmark shape: `npl=32`, `S_v=128`.
+
+| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB |
+|-------|---|-----|----|----------|--------|----------------|-------------------|
+| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 |
+| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 |
+| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 |
+| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 |
+| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 |
+| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 |
+| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 |
+| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 |
+| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 |
+| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 |
+| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 |
+| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 |
+| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 |
+| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 |
+| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 |
+| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 |
+
+`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs.
+
+## Interpretation
+
+The f32 `BT=32` scratch path is large but plausible:
+
+- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`.
+- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call.
+- This is not free on LPDDR5x, but it is not automatically worse than
+  recomputing A/Ai in every value slab.
+
+The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and
+Phase 11 showed correctness must be established before performance. The first
+prototype should store Ai in f32, stay default-off, and use md5/KL gates before
+trying a lossy Ai dtype.
+
+## Decision
+
+GO: Phase 13 should implement a default-off global-Ai scratch prototype.
+
+Rationale:
+
+- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai
+  across value slabs.
+- `BT=32` f32 scratch has acceptable peak memory for the existing GB10
+  benchmark shapes.
+- The implementation can be default-off and rejected cleanly if global scratch
+  traffic or extra launch boundaries dominate.
+
+Phase 13 constraints:
+
+- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs.
+- Keep decode out via `GDN_CHUNK_MIN > 1`.
+- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B.
+- If md5 changes, run KL before benchmarking.
+- If the prototype is flat or slower, reject it and stop GDN kernel work on
+  GB10; do not iterate into f16 Ai until f32 proves the schedule can win.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -175,9 +175,14 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
 | bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
 | Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
 | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
+| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |

 Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.

+Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
+`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
+on GB10 instead of iterating into f16 Ai or more local reorders.
+
 ### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
 | Lever | What | Verdict | Key number |
 |---|---|---|---|
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
@@ -174,6 +174,7 @@ products through tensor cores. The series chased that headroom.
 | bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
 | Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
 | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
+| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |

 **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
 intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
@@ -185,6 +186,12 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
 that and is the shipped winner; it does not fully close the 2.62x because vLLM's
 mature FLA blocked-solve is a more complete tensor-core implementation.

+Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
+default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
+but not automatically disqualifying. If that prototype is flat or slower, GDN
+kernel work on GB10 should stop rather than moving to f16 Ai or additional
+local reorders.
+
 ### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)

 **Methodology correction - why every earlier decode decomposition was wrong.**
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -521,6 +521,34 @@ Artifacts:
 - `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
 - `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`

+### Phase 12 GDN shared-A/Ai cost-model update
+
+Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per
+`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs.
+
+Cost model:
+
+- C16 full-width M5 uses `93,376 B` dynamic smem.
+- C32 full-width would need `127,360 B`, which does not fit GB10.
+- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is
+  recomputed per slab.
+- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is:
+  - MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic.
+  - Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic.
+
+Decision:
+
+- **GO** to a default-off Phase 13 prototype, not a shipped patch.
+- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs.
+- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work
+  on GB10 rather than iterating into f16 Ai or more local reorders.
+
+Docs:
+
+- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
+- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
+- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
+
 ---

 # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)