mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): scope GDN global Ai32 prototype
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -983,3 +983,45 @@ Conclusion:
|
||||
- The next GDN attempt should skip local scheduling-only changes and scope a
|
||||
true shared-A/Ai blocked-solve or global-scratch design, with an explicit
|
||||
scratch/synchronization cost model before coding.
|
||||
|
||||
## Phase 12 GDN Shared-A/Ai Cost Model
|
||||
|
||||
Phase 12 evaluated whether a real shared-A/Ai design is credible enough to
|
||||
prototype after the C32 slab and QS-early shortcut rejections.
|
||||
|
||||
Cost-model doc:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
|
||||
|
||||
Metadata artifact:
|
||||
|
||||
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
|
||||
|
||||
Model dimensions:
|
||||
|
||||
| Model | GDN layers | H | S_v | Metadata basis |
|
||||
|-------|------------|---|-----|----------------|
|
||||
| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` |
|
||||
| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` |
|
||||
|
||||
Dynamic-smem result for `S_v=128`:
|
||||
|
||||
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|
||||
|-------|-------|-----|-------------------------|
|
||||
| C16 full-width | 93,376 | 91.19 | yes |
|
||||
| C32 full-width | 127,360 | 124.38 | no |
|
||||
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
|
||||
|
||||
Ai scratch result at `npp=2048,npl=32,BT=32,f32`:
|
||||
|
||||
| Model | Ai scratch MiB | 3x Ai traffic MiB |
|
||||
|-------|----------------|-------------------|
|
||||
| MoE | 256.0 | 768.0 |
|
||||
| Dense | 384.0 | 1152.0 |
|
||||
|
||||
Decision:
|
||||
|
||||
- GO for a default-off Phase 13 global-Ai32 prototype.
|
||||
- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
|
||||
- The prototype must be rejected if it is flat or slower; do not iterate into
|
||||
f16/BF16 Ai unless f32 proves the schedule can win.
|
||||
|
||||
@@ -0,0 +1,142 @@
|
||||
# GDN Shared-A/Ai Cost Model
|
||||
|
||||
Phase 12 decides whether the next GDN prefill attempt should implement a
|
||||
shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10.
|
||||
|
||||
## Reference Points
|
||||
|
||||
llama.cpp:
|
||||
|
||||
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
- `gated_delta_net_chunked_cuda`
|
||||
- `launch_gdn_chunked`
|
||||
- `launch_gated_delta_net`
|
||||
- `ggml_cuda_op_gated_delta_net`
|
||||
|
||||
vLLM/FLA:
|
||||
|
||||
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
|
||||
- `chunk_gated_delta_rule_fwd`
|
||||
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
|
||||
- `solve_tril`
|
||||
- `solve_tril_16x16_kernel`
|
||||
- `merge_16x16_to_32x32_inverse_kernel`
|
||||
- `merge_16x16_to_64x64_inverse_kernel`
|
||||
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
|
||||
- `recompute_w_u_fwd`
|
||||
|
||||
## Metadata
|
||||
|
||||
DGX metadata artifact:
|
||||
|
||||
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
|
||||
|
||||
GGUF metadata:
|
||||
|
||||
| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads |
|
||||
|-------|------|--------|--------------------|------------|-----------|-----------|-----------|
|
||||
| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred |
|
||||
| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred |
|
||||
|
||||
Notes:
|
||||
|
||||
- `GDN heads = ssm.inner_size / ssm.state_size`.
|
||||
- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal
|
||||
layers, with 30 GDN layers at interval 4.
|
||||
- Dense has 64 layers, 48 GDN layers at interval 4.
|
||||
|
||||
## Dynamic Shared Memory
|
||||
|
||||
Formula:
|
||||
|
||||
```text
|
||||
C16 full-width current M5:
|
||||
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
|
||||
|
||||
C32 full-width:
|
||||
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
|
||||
|
||||
C32 slab64 with U staging:
|
||||
floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
|
||||
```
|
||||
|
||||
For `S_v=128`:
|
||||
|
||||
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|
||||
|-------|-------|-----|-------------------------|
|
||||
| C16 full-width | 93,376 | 91.19 | yes |
|
||||
| C32 full-width | 127,360 | 124.38 | no |
|
||||
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
|
||||
|
||||
Implication:
|
||||
|
||||
- C32 full-width cannot be a single current-style CTA on GB10.
|
||||
- C32 only fits by splitting value columns or by changing state residency.
|
||||
- Splitting value columns must share A/Ai or it repeats the Phase 10 failure.
|
||||
|
||||
## Ai Scratch Size
|
||||
|
||||
Formula:
|
||||
|
||||
```text
|
||||
Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
|
||||
```
|
||||
|
||||
Benchmark shape: `npl=32`, `S_v=128`.
|
||||
|
||||
| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB |
|
||||
|-------|---|-----|----|----------|--------|----------------|-------------------|
|
||||
| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 |
|
||||
| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 |
|
||||
| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 |
|
||||
| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 |
|
||||
| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 |
|
||||
| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 |
|
||||
| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 |
|
||||
| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 |
|
||||
| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 |
|
||||
| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 |
|
||||
| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 |
|
||||
| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 |
|
||||
| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 |
|
||||
| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 |
|
||||
| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 |
|
||||
| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 |
|
||||
|
||||
`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs.
|
||||
|
||||
## Interpretation
|
||||
|
||||
The f32 `BT=32` scratch path is large but plausible:
|
||||
|
||||
- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`.
|
||||
- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call.
|
||||
- This is not free on LPDDR5x, but it is not automatically worse than
|
||||
recomputing A/Ai in every value slab.
|
||||
|
||||
The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and
|
||||
Phase 11 showed correctness must be established before performance. The first
|
||||
prototype should store Ai in f32, stay default-off, and use md5/KL gates before
|
||||
trying a lossy Ai dtype.
|
||||
|
||||
## Decision
|
||||
|
||||
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
|
||||
|
||||
Rationale:
|
||||
|
||||
- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai
|
||||
across value slabs.
|
||||
- `BT=32` f32 scratch has acceptable peak memory for the existing GB10
|
||||
benchmark shapes.
|
||||
- The implementation can be default-off and rejected cleanly if global scratch
|
||||
traffic or extra launch boundaries dominate.
|
||||
|
||||
Phase 13 constraints:
|
||||
|
||||
- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs.
|
||||
- Keep decode out via `GDN_CHUNK_MIN > 1`.
|
||||
- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B.
|
||||
- If md5 changes, run KL before benchmarking.
|
||||
- If the prototype is flat or slower, reject it and stop GDN kernel work on
|
||||
GB10; do not iterate into f16 Ai until f32 proves the schedule can win.
|
||||
@@ -175,9 +175,14 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
|
||||
| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
|
||||
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
|
||||
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
|
||||
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |
|
||||
|
||||
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.
|
||||
|
||||
Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
|
||||
`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
|
||||
on GB10 instead of iterating into f16 Ai or more local reorders.
|
||||
|
||||
### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
|
||||
| Lever | What | Verdict | Key number |
|
||||
|---|---|---|---|
|
||||
|
||||
@@ -174,6 +174,7 @@ products through tensor cores. The series chased that headroom.
|
||||
| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
|
||||
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
|
||||
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
|
||||
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |
|
||||
|
||||
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
|
||||
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
|
||||
@@ -185,6 +186,12 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
|
||||
that and is the shipped winner; it does not fully close the 2.62x because vLLM's
|
||||
mature FLA blocked-solve is a more complete tensor-core implementation.
|
||||
|
||||
Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
|
||||
default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
|
||||
but not automatically disqualifying. If that prototype is flat or slower, GDN
|
||||
kernel work on GB10 should stop rather than moving to f16 Ai or additional
|
||||
local reorders.
|
||||
|
||||
### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)
|
||||
|
||||
**Methodology correction - why every earlier decode decomposition was wrong.**
|
||||
|
||||
@@ -521,6 +521,34 @@ Artifacts:
|
||||
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
|
||||
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`
|
||||
|
||||
### Phase 12 GDN shared-A/Ai cost-model update
|
||||
|
||||
Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per
|
||||
`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs.
|
||||
|
||||
Cost model:
|
||||
|
||||
- C16 full-width M5 uses `93,376 B` dynamic smem.
|
||||
- C32 full-width would need `127,360 B`, which does not fit GB10.
|
||||
- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is
|
||||
recomputed per slab.
|
||||
- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is:
|
||||
- MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic.
|
||||
- Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic.
|
||||
|
||||
Decision:
|
||||
|
||||
- **GO** to a default-off Phase 13 prototype, not a shipped patch.
|
||||
- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs.
|
||||
- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work
|
||||
on GB10 rather than iterating into f16 Ai or more local reorders.
|
||||
|
||||
Docs:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
|
||||
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
|
||||
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
|
||||
|
||||
---
|
||||
|
||||
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
|
||||
|
||||
Reference in New Issue
Block a user