docs(paged): scope GDN global Ai32 prototype

Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 01:38:51 +00:00
parent 1b5ae227eb
commit adabd11919
9 changed files with 1159 additions and 0 deletions

View File

@@ -983,3 +983,45 @@ Conclusion:
- The next GDN attempt should skip local scheduling-only changes and scope a
true shared-A/Ai blocked-solve or global-scratch design, with an explicit
scratch/synchronization cost model before coding.
## Phase 12 GDN Shared-A/Ai Cost Model
Phase 12 evaluated whether a real shared-A/Ai design is credible enough to
prototype after the C32 slab and QS-early shortcut rejections.
Cost-model doc:
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
Metadata artifact:
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
Model dimensions:
| Model | GDN layers | H | S_v | Metadata basis |
|-------|------------|---|-----|----------------|
| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` |
| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` |
Dynamic-smem result for `S_v=128`:
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|-------|-------|-----|-------------------------|
| C16 full-width | 93,376 | 91.19 | yes |
| C32 full-width | 127,360 | 124.38 | no |
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
Ai scratch result at `npp=2048,npl=32,BT=32,f32`:
| Model | Ai scratch MiB | 3x Ai traffic MiB |
|-------|----------------|-------------------|
| MoE | 256.0 | 768.0 |
| Dense | 384.0 | 1152.0 |
Decision:
- GO for a default-off Phase 13 global-Ai32 prototype.
- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
- The prototype must be rejected if it is flat or slower; do not iterate into
f16/BF16 Ai unless f32 proves the schedule can win.

View File

@@ -0,0 +1,142 @@
# GDN Shared-A/Ai Cost Model
Phase 12 decides whether the next GDN prefill attempt should implement a
shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10.
## Reference Points
llama.cpp:
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- `gated_delta_net_chunked_cuda`
- `launch_gdn_chunked`
- `launch_gated_delta_net`
- `ggml_cuda_op_gated_delta_net`
vLLM/FLA:
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
- `chunk_gated_delta_rule_fwd`
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
- `solve_tril`
- `solve_tril_16x16_kernel`
- `merge_16x16_to_32x32_inverse_kernel`
- `merge_16x16_to_64x64_inverse_kernel`
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
- `recompute_w_u_fwd`
## Metadata
DGX metadata artifact:
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
GGUF metadata:
| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads |
|-------|------|--------|--------------------|------------|-----------|-----------|-----------|
| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred |
| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred |
Notes:
- `GDN heads = ssm.inner_size / ssm.state_size`.
- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal
layers, with 30 GDN layers at interval 4.
- Dense has 64 layers, 48 GDN layers at interval 4.
## Dynamic Shared Memory
Formula:
```text
C16 full-width current M5:
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
C32 full-width:
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
C32 slab64 with U staging:
floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
```
For `S_v=128`:
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|-------|-------|-----|-------------------------|
| C16 full-width | 93,376 | 91.19 | yes |
| C32 full-width | 127,360 | 124.38 | no |
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
Implication:
- C32 full-width cannot be a single current-style CTA on GB10.
- C32 only fits by splitting value columns or by changing state residency.
- Splitting value columns must share A/Ai or it repeats the Phase 10 failure.
## Ai Scratch Size
Formula:
```text
Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
```
Benchmark shape: `npl=32`, `S_v=128`.
| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB |
|-------|---|-----|----|----------|--------|----------------|-------------------|
| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 |
| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 |
| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 |
| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 |
| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 |
| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 |
| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 |
| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 |
| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 |
| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 |
| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 |
| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 |
| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 |
| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 |
| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 |
| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 |
`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs.
## Interpretation
The f32 `BT=32` scratch path is large but plausible:
- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`.
- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call.
- This is not free on LPDDR5x, but it is not automatically worse than
recomputing A/Ai in every value slab.
The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and
Phase 11 showed correctness must be established before performance. The first
prototype should store Ai in f32, stay default-off, and use md5/KL gates before
trying a lossy Ai dtype.
## Decision
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
Rationale:
- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai
across value slabs.
- `BT=32` f32 scratch has acceptable peak memory for the existing GB10
benchmark shapes.
- The implementation can be default-off and rejected cleanly if global scratch
traffic or extra launch boundaries dominate.
Phase 13 constraints:
- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs.
- Keep decode out via `GDN_CHUNK_MIN > 1`.
- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B.
- If md5 changes, run KL before benchmarking.
- If the prototype is flat or slower, reject it and stop GDN kernel work on
GB10; do not iterate into f16 Ai until f32 proves the schedule can win.

View File

@@ -175,9 +175,14 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.
Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
on GB10 instead of iterating into f16 Ai or more local reorders.
### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
| Lever | What | Verdict | Key number |
|---|---|---|---|

View File

@@ -174,6 +174,7 @@ products through tensor cores. The series chased that headroom.
| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
@@ -185,6 +186,12 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
that and is the shipped winner; it does not fully close the 2.62x because vLLM's
mature FLA blocked-solve is a more complete tensor-core implementation.
Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
but not automatically disqualifying. If that prototype is flat or slower, GDN
kernel work on GB10 should stop rather than moving to f16 Ai or additional
local reorders.
### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)
**Methodology correction - why every earlier decode decomposition was wrong.**

View File

@@ -521,6 +521,34 @@ Artifacts:
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`
### Phase 12 GDN shared-A/Ai cost-model update
Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per
`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs.
Cost model:
- C16 full-width M5 uses `93,376 B` dynamic smem.
- C32 full-width would need `127,360 B`, which does not fit GB10.
- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is
recomputed per slab.
- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is:
- MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic.
- Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic.
Decision:
- **GO** to a default-off Phase 13 prototype, not a shipped patch.
- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs.
- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work
on GB10 rather than iterating into f16 Ai or more local reorders.
Docs:
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
---
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)