From adabd11919a5c7a4590b63a40deb7494440be7cf Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 01:38:51 +0000 Subject: [PATCH] docs(paged): scope GDN global Ai32 prototype Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 42 ++ .../docs/GDN_SHARED_AI_COST_MODEL.md | 142 +++++++ .../docs/PARITY_HANDOFF.md | 5 + .../docs/VLLM_PARITY_FINAL.md | 7 + .../docs/VLLM_PARITY_LEVER_MAP.md | 28 ++ ...6-07-01-gdn-global-ai-prototype-phase13.md | 398 ++++++++++++++++++ ...-07-01-gdn-shared-ai-cost-model-phase12.md | 332 +++++++++++++++ ...26-07-01-gdn-global-ai-prototype-design.md | 97 +++++ ...6-07-01-gdn-shared-ai-cost-model-design.md | 108 +++++ 9 files changed, 1159 insertions(+) create mode 100644 backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md create mode 100644 docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md create mode 100644 docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md create mode 100644 docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md create mode 100644 docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 1af131907..916c4f463 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -983,3 +983,45 @@ Conclusion: - The next GDN attempt should skip local scheduling-only changes and scope a true shared-A/Ai blocked-solve or global-scratch design, with an explicit scratch/synchronization cost model before coding. + +## Phase 12 GDN Shared-A/Ai Cost Model + +Phase 12 evaluated whether a real shared-A/Ai design is credible enough to +prototype after the C32 slab and QS-early shortcut rejections. + +Cost-model doc: + +- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` + +Metadata artifact: + +- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt` + +Model dimensions: + +| Model | GDN layers | H | S_v | Metadata basis | +|-------|------------|---|-----|----------------| +| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` | +| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` | + +Dynamic-smem result for `S_v=128`: + +| Shape | Bytes | KiB | Fits GB10 dynamic smem? | +|-------|-------|-----|-------------------------| +| C16 full-width | 93,376 | 91.19 | yes | +| C32 full-width | 127,360 | 124.38 | no | +| C32 slab64 + U staging | 94,592 | 92.38 | yes | + +Ai scratch result at `npp=2048,npl=32,BT=32,f32`: + +| Model | Ai scratch MiB | 3x Ai traffic MiB | +|-------|----------------|-------------------| +| MoE | 256.0 | 768.0 | +| Dense | 384.0 | 1152.0 | + +Decision: + +- GO for a default-off Phase 13 global-Ai32 prototype. +- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`. +- The prototype must be rejected if it is flat or slower; do not iterate into + f16/BF16 Ai unless f32 proves the schedule can win. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md new file mode 100644 index 000000000..27094f5c8 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md @@ -0,0 +1,142 @@ +# GDN Shared-A/Ai Cost Model + +Phase 12 decides whether the next GDN prefill attempt should implement a +shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10. + +## Reference Points + +llama.cpp: + +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + - `gated_delta_net_chunked_cuda` + - `launch_gdn_chunked` + - `launch_gated_delta_net` + - `ggml_cuda_op_gated_delta_net` + +vLLM/FLA: + +- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py` + - `chunk_gated_delta_rule_fwd` +- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py` + - `solve_tril` + - `solve_tril_16x16_kernel` + - `merge_16x16_to_32x32_inverse_kernel` + - `merge_16x16_to_64x64_inverse_kernel` +- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py` + - `recompute_w_u_fwd` + +## Metadata + +DGX metadata artifact: + +- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt` + +GGUF metadata: + +| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads | +|-------|------|--------|--------------------|------------|-----------|-----------|-----------| +| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred | +| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred | + +Notes: + +- `GDN heads = ssm.inner_size / ssm.state_size`. +- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal + layers, with 30 GDN layers at interval 4. +- Dense has 64 layers, 48 GDN layers at interval 4. + +## Dynamic Shared Memory + +Formula: + +```text +C16 full-width current M5: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 full-width: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 slab64 with U staging: + floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C +``` + +For `S_v=128`: + +| Shape | Bytes | KiB | Fits GB10 dynamic smem? | +|-------|-------|-----|-------------------------| +| C16 full-width | 93,376 | 91.19 | yes | +| C32 full-width | 127,360 | 124.38 | no | +| C32 slab64 + U staging | 94,592 | 92.38 | yes | + +Implication: + +- C32 full-width cannot be a single current-style CTA on GB10. +- C32 only fits by splitting value columns or by changing state residency. +- Splitting value columns must share A/Ai or it repeats the Phase 10 failure. + +## Ai Scratch Size + +Formula: + +```text +Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype) +``` + +Benchmark shape: `npl=32`, `S_v=128`. + +| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB | +|-------|---|-----|----|----------|--------|----------------|-------------------| +| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 | +| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 | +| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 | +| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 | +| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 | +| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 | +| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 | +| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 | +| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 | +| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 | +| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 | +| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 | +| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 | +| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 | +| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 | +| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 | + +`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs. + +## Interpretation + +The f32 `BT=32` scratch path is large but plausible: + +- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`. +- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call. +- This is not free on LPDDR5x, but it is not automatically worse than + recomputing A/Ai in every value slab. + +The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and +Phase 11 showed correctness must be established before performance. The first +prototype should store Ai in f32, stay default-off, and use md5/KL gates before +trying a lossy Ai dtype. + +## Decision + +GO: Phase 13 should implement a default-off global-Ai scratch prototype. + +Rationale: + +- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai + across value slabs. +- `BT=32` f32 scratch has acceptable peak memory for the existing GB10 + benchmark shapes. +- The implementation can be default-off and rejected cleanly if global scratch + traffic or extra launch boundaries dominate. + +Phase 13 constraints: + +- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs. +- Keep decode out via `GDN_CHUNK_MIN > 1`. +- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B. +- If md5 changes, run KL before benchmarking. +- If the prototype is flat or slower, reject it and stop GDN kernel work on + GB10; do not iterate into f16 Ai until f32 proves the schedule can win. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index dd4d52de3..750e7295e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -175,9 +175,14 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga | bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates | | Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 | | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 | +| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation. +Phase 12 caveat: this is not a shipped win. It authorizes only a default-off +`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work +on GB10 instead of iterating into f16 Ai or more local reorders. + ### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady) | Lever | What | Verdict | Key number | |---|---|---|---| diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index 0b7afa6a6..f8f955fde 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -174,6 +174,7 @@ products through tensor cores. The series chased that headroom. | bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan | | Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab | | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab | +| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model | **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2) intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not @@ -185,6 +186,12 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly that and is the shipped winner; it does not fully close the 2.62x because vLLM's mature FLA blocked-solve is a more complete tensor-core implementation. +Post-record caveat: Phase 12 does not change the shipped verdict. It permits one +default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large +but not automatically disqualifying. If that prototype is flat or slower, GDN +kernel work on GB10 should stop rather than moving to f16 Ai or additional +local reorders. + ### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact) **Methodology correction - why every earlier decode decomposition was wrong.** diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index f4dfe78ce..59d410d9b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -521,6 +521,34 @@ Artifacts: - `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/` - `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff` +### Phase 12 GDN shared-A/Ai cost-model update + +Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per +`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs. + +Cost model: + +- C16 full-width M5 uses `93,376 B` dynamic smem. +- C32 full-width would need `127,360 B`, which does not fit GB10. +- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is + recomputed per slab. +- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is: + - MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic. + - Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic. + +Decision: + +- **GO** to a default-off Phase 13 prototype, not a shipped patch. +- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs. +- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work + on GB10 rather than iterating into f16 Ai or more local reorders. + +Docs: + +- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` +- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md` +- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md` + --- # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026) diff --git a/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md b/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md new file mode 100644 index 000000000..e7f99e9fc --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md @@ -0,0 +1,398 @@ +# GDN Global-Ai Prototype Phase 13 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Implement and test a default-off C32 GDN prefill prototype that computes f32 Ai once per chunk/head and reuses it across two value slabs. + +**Architecture:** The prototype adds one Ai precompute kernel plus one Ai-consuming chunked kernel in `gated_delta_net.cu`. Scratch is allocated from the existing ggml CUDA pool in `ggml_cuda_op_gated_delta_net`, scoped to the op, and only used when `GDN_GLOBAL_AI32=1`. + +**Tech Stack:** llama.cpp CUDA, ggml CUDA pool allocator, GB10 DGX benchmark harness, Qwen3.6 NVFP4 GGUF gates. + +--- + +## Guardrails + +- Default path remains current C16 M5. +- Candidate engages only with `GDN_GLOBAL_AI32=1`. +- Prototype only supports `S_v=128`, `C=32`, `DV_TILE=64`, f32 Ai. +- Keep `GDN_CHUNK_MIN > 1`; decode must never use this path. +- Do not add f16/BF16 Ai until f32 Ai wins. +- Do not generate a LocalAI patch unless the fork implementation passes gates + and improves S_PP. + +## Task 1: Preflight + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` +- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/` + +- [ ] **Step 1: Check DGX is free** + +Run: + +```bash +ssh dgx.casa 'set -e +echo docker=$(docker ps -q | wc -l) +echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) +echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) +if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi' +``` + +Expected: + +```text +docker=0 +local_ai_worker=0 +compute=0 +FREE... +``` + +- [ ] **Step 2: Record provenance** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp status --short +git -C /home/mudler/_git/llama.cpp rev-parse HEAD +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD' +``` + +Expected: both llama.cpp trees are clean. + +- [ ] **Step 3: Create artifacts** + +Run: + +```bash +ssh dgx.casa 'mkdir -p /home/mudler/bench/phase13_gdn_global_ai32/{gates,ab,rejected}' +``` + +Expected: command exits 0. + +## Task 2: Add Ai Scratch Plumbing + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [ ] **Step 1: Add env selector in `ggml_cuda_op_gated_delta_net`** + +Add after `keep_rs` is computed: + +```cpp +static const bool gdn_global_ai32 = []{ + const char * e = getenv("GDN_GLOBAL_AI32"); + return e && atoi(e) != 0; +}(); +``` + +- [ ] **Step 2: Allocate Ai scratch only for supported calls** + +Add: + +```cpp +float * ai32_d = nullptr; +int64_t ai32_chunks = 0; +ggml_cuda_pool_alloc ai32_scratch(ctx.pool()); +if (gdn_global_ai32 && !kda && !keep_rs && S_v == 128 && n_tokens > 1) { + ai32_chunks = (n_tokens + 31) / 32; + ai32_d = ai32_scratch.alloc((size_t) n_seqs * H * ai32_chunks * 32 * 32); +} +``` + +Pass `ai32_d` and `ai32_chunks` into the non-KDA/non-keep launch call only. +Other launch calls pass `nullptr, 0`. + +- [ ] **Step 3: Extend `launch_gated_delta_net` signature** + +Change the signature to include: + +```cpp +float * ai32_d, int64_t ai32_chunks, +``` + +before `float scale`. Thread these through all four call sites. + +## Task 3: Add Ai Precompute Kernel + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [ ] **Step 1: Add `gdn_ai32_cuda`** + +Add a kernel near `gated_delta_net_chunked_cuda`: + +```cpp +template +__global__ void gdn_ai32_cuda( + const float * __restrict__ k, + const float * __restrict__ g, + const float * __restrict__ beta, + float * __restrict__ ai, + int64_t H, int64_t n_tokens, int64_t n_seqs, + int64_t sq1, int64_t sq2, int64_t sq3, + int64_t sb1, int64_t sb2, int64_t sb3, + uint3 neqk1_magic, uint3 rq3_magic) { + // CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=chunk. + // Shared: Kc[C*S_v], A[C*C], csh[C], gam[C], bet[C], KKsh[C*C]. + // Compute Kc, prefix csh/gam, KK, A, then exact f32 inverse into ai. +} +``` + +The inverse algorithm must match the existing M5 f32 inverse: + +```cpp +if (j < C) { + if (j < Cc) { + float x[C]; + for (int r = 0; r < C; r++) x[r] = 0.0f; + x[j] = 1.0f; + for (int r = j + 1; r < Cc; r++) { + float acc = 0.0f; + for (int m = j; m < r; m++) acc += A[r * C + m] * x[m]; + x[r] = -acc; + } + for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = x[r]; + } else { + for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = 0.0f; + } +} +``` + +Use fixed stride `C` in scratch, zeroing out-of-range tail rows/columns. + +- [ ] **Step 2: Add launcher** + +Add: + +```cpp +template +static void launch_gdn_ai32(..., float * ai32_d, int64_t ai32_chunks, cudaStream_t stream) +``` + +Launch grid: + +```cpp +dim3 grid_dims(H, n_seqs, ai32_chunks); +dim3 block_dims(S_v, 1, 1); +``` + +Dynamic smem: + +```cpp +((size_t) C * S_v + (size_t) C * C + (size_t) 3 * C + (size_t) C * C) * sizeof(float) +``` + +## Task 4: Add Ai-Consuming C32 Slab Kernel + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [ ] **Step 1: Add `gated_delta_net_chunked_ai32_cuda`** + +Add a separate kernel rather than overloading the shipped M5 body: + +```cpp +template +__global__ void gated_delta_net_chunked_ai32_cuda( + const float * __restrict__ q, + const float * __restrict__ k, + const float * __restrict__ v, + const float * __restrict__ g, + const float * __restrict__ beta, + const float * __restrict__ curr_state, + float * __restrict__ dst, + const float * __restrict__ ai, + int64_t H, int64_t n_tokens, int64_t n_seqs, + int64_t sq1, int64_t sq2, int64_t sq3, + int64_t sv1, int64_t sv2, int64_t sv3, + int64_t sb1, int64_t sb2, int64_t sb3, + uint3 neqk1_magic, uint3 rq3_magic, + float scale, float * __restrict__ state_dst, + const int32_t * __restrict__ ids, int rs_head) { + // CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=value slab. + // C=32, DV_TILE=64. + // Load the full source state stride S_v*S_v but own only columns [slab*DV_TILE, +DV_TILE). + // For every chunk, load Kc/Qc/csh/gam/bet, build RHS, load Ai, apply U = Ai*RHS, + // build P from QK, compute O, update owned state columns, write owned state columns. +} +``` + +Use the Phase 10 tail-row fix: + +```cpp +Ud[j * C + t] = (t < Cc) ? staged_value : 0.0f; +``` + +and use full state stride for reads/writes: + +```cpp +(int64_t) seq * H * S_v * S_v + (int64_t) h_idx * S_v * S_v +``` + +- [ ] **Step 2: Add launcher** + +Add: + +```cpp +template +static void launch_gdn_chunked_ai32(..., const float * ai32_d, int64_t ai32_chunks, ...) +``` + +Launch grid: + +```cpp +dim3 grid_dims(H, n_seqs, S_v / DV_TILE); +dim3 block_dims(DV_TILE, 1, 1); +``` + +The smem formula must stay under the C32 slab Phase 10 budget: + +```cpp +((size_t) S_v * DV_TILE + (size_t) 2 * C * S_v + (size_t) DV_TILE * C + + (size_t) C * C + (size_t) 3 * C + (size_t) C * C + + (size_t) DV_TILE * C) * sizeof(float) +``` + +## Task 5: Route Candidate + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [ ] **Step 1: Add route in `launch_gated_delta_net`** + +Before the existing `GDN_CHUNKED_LAUNCH` switch: + +```cpp +if (ai32_d != nullptr && ai32_chunks > 0 && S_v == 128 && n_tokens >= gdn_chunk_min) { + launch_gdn_ai32<128, 32>(...); + launch_gdn_chunked_ai32<128, 32, 64>(...); + return; +} +``` + +The route must require `!KDA && !keep_rs_t` via the existing template branch and +must not trigger for decode-sized calls. + +- [ ] **Step 2: Keep default path unchanged** + +Run: + +```bash +git diff -- ggml/src/ggml-cuda/gated_delta_net.cu +``` + +Check that default `GDN_TC=5` still launches `launch_gdn_chunked<128, 16, 4>`. + +## Task 6: Build and Correctness Gates + +**Files:** +- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/gates/` + +- [ ] **Step 1: Mirror and build** + +Run: + +```bash +rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \ + dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8' +``` + +Expected: build exits 0. + +- [ ] **Step 2: Run op gates** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +ART=$HOME/bench/phase13_gdn_global_ai32/gates +./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1 +GDN_GLOBAL_AI32=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_global_ai32.txt" 2>&1' +``` + +Expected: both logs show CUDA0 OK for all cases. + +- [ ] **Step 3: Run canonical md5 gates** + +Run default and candidate MoE/dense completion gates. Expected: + +```text +MoE 8cb0ce23777bf55f92f63d0292c756b0 +Dense 5951a5b4d624ce891e22ab5fca9bc439 +``` + +If candidate md5 differs, run the KL gate before benchmarking. + +## Task 7: Performance A/B + +**Files:** +- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/ab/` + +- [ ] **Step 1: Run same-session A/B** + +Run MoE and dense: + +```bash +LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1" +LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1 GGML_NO_BACKTRACE=1" +``` + +Use: + +```bash +./llama-batched-bench -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 +``` + +Expected: candidate improves S_PP without dense regression. + +- [ ] **Step 2: Decide** + +Accept only if: + +- op gate passes, +- md5 is canonical or KL-benign, +- MoE S_PP improves, +- dense S_PP does not regress outside noise. + +Reject if flat or slower. + +## Task 8: Mirror or Reject + +**Files:** +- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [ ] **Step 1: If accepted, commit fork patch and generate LocalAI patch** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/gated_delta_net.cu +git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): add GDN global Ai32 prefill prototype" +git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD --stdout \ + > backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-GDN-global-Ai32-prefill-prototype.patch +``` + +- [ ] **Step 2: If rejected, save diff and restore** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \ + > /home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff +git -C /home/mudler/_git/llama.cpp checkout -- ggml/src/ggml-cuda/gated_delta_net.cu +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git checkout -- ggml/src/ggml-cuda/gated_delta_net.cu' +``` + +- [ ] **Step 3: Commit LocalAI docs** + +Commit accepted patch/docs or rejected docs with: + +```bash +git commit -m "docs(paged): record GDN global Ai32 result" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md b/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md new file mode 100644 index 000000000..5ecc56fd6 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md @@ -0,0 +1,332 @@ +# GDN Shared-A/Ai Cost Model Phase 12 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Decide whether a shared-A/Ai C32 GDN design is worth implementing on GB10 before touching llama.cpp source. + +**Architecture:** Phase 12 is analysis-first and docs-only unless the cost model proves a credible win. It extracts model dimensions, computes dynamic-smem and global-scratch pressure, estimates traffic saved versus traffic added, and writes a go/no-go decision for a possible Phase 13 global-scratch prototype. + +**Tech Stack:** llama.cpp CUDA GDN kernel geometry, vLLM/FLA chunked GDN references, DGX GB10 benchmark artifacts, LocalAI parity docs. + +--- + +## Guardrails + +- Do not edit llama.cpp source in this phase. +- Do not generate a LocalAI patch file in this phase. +- Treat Phase 10 and Phase 11 as rejected; do not reopen C32 slab or QS-early. +- Use actual model metadata where available; if a dimension is inferred, mark it + as inferred. +- The output is a go/no-go decision, not an implementation patch. + +## Task 1: Gather Current Evidence + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py` +- Read: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/` + +- [x] **Step 1: Check tree state** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp status --short +git -C /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention status --short +``` + +Expected: + +- llama.cpp fork is clean. +- LocalAI worktree only has this Phase 12 docs work and untracked `.claude/`. + +- [x] **Step 2: Create artifact directory** + +Run: + +```bash +ssh dgx.casa 'mkdir -p /home/mudler/bench/phase12_gdn_shared_ai_cost_model' +``` + +Expected: command exits 0. + +- [x] **Step 3: Record reference function map** + +Record these llama.cpp insertion points in the result doc: + +```text +/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu + gated_delta_net_chunked_cuda + launch_gdn_chunked + launch_gated_delta_net + ggml_cuda_op_gated_delta_net +``` + +Record these vLLM reference functions: + +```text +/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py + chunk_gated_delta_rule_fwd +/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py + solve_tril + solve_tril_16x16_kernel + merge_16x16_to_32x32_inverse_kernel + merge_16x16_to_64x64_inverse_kernel +/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py + recompute_w_u_fwd +``` + +Result: recorded in +`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`. + +## Task 2: Extract Model Dimensions + +**Files:** +- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt` + +- [x] **Step 1: Extract GGUF metadata** + +Run on DGX: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +{ + echo "=== MoE ===" + ./llama-show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -n 0 2>/dev/null || true + echo "=== Dense ===" + ./llama-show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf -n 0 2>/dev/null || true +} > /home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt' +``` + +Expected: metadata file contains head count, layer count, and head dimension +or enough tensor metadata to infer them. + +Result: + +- Metadata artifact: + `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`. +- `llama-show-info` was not present in the DGX build, so a minimal read-only + GGUF metadata parser was used. + +- [x] **Step 2: Summarize GDN dimensions** + +Write a short table in the result doc: + +```text +Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64 +``` + +Use benchmark shapes: + +- `npl=32` +- `npp=512,2048` +- `S_v=128` + +If H cannot be read directly from metadata, infer it from source/model docs and +mark the row as inferred. + +Result: + +| Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64 | +|-------|------------|---|-----|---------------|-----|-----------------|-----------------| +| MoE | 30 inferred | 32 inferred | 128 | 32 | 512 | 16 | 8 | +| MoE | 30 inferred | 32 inferred | 128 | 32 | 2048 | 64 | 32 | +| Dense | 48 inferred | 48 inferred | 128 | 32 | 512 | 16 | 8 | +| Dense | 48 inferred | 48 inferred | 128 | 32 | 2048 | 64 | 32 | + +`H = ssm.inner_size / ssm.state_size`. + +## Task 3: Compute Smem and Scratch Costs + +**Files:** +- Create: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` + +- [x] **Step 1: Record dynamic-smem formulas** + +Use: + +```text +C16 full-width current M5: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 full-width: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 slab64 with U staging: + floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C +``` + +Expected values for `S_v=128`: + +```text +C16 full-width: 93,376 B / 91.19 KiB +C32 full-width: 127,360 B / 124.38 KiB +C32 slab64: 94,592 B / 92.38 KiB +``` + +- [x] **Step 2: Record Ai scratch formulas** + +Use: + +```text +Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype) +``` + +Compute for: + +- `BT=32`, f32 and f16/bf16 Ai. +- `BT=64`, f32 and f16/bf16 Ai. +- `npp=512` and `npp=2048`. + +- [x] **Step 3: Estimate extra global traffic** + +For a two-slab C32 design, estimate: + +```text +Ai write once = npl * H * nchunks * BT * BT * sizeof(Ai) +Ai read per slab = 2 * Ai write once +total Ai traffic = 3 * Ai write once +``` + +Record the estimate in MiB for every benchmark shape. + +- [x] **Step 4: Estimate work saved** + +Record that shared Ai saves duplicated A/T construction per second slab: + +```text +saved per chunk/head = one KK/QK-derived A/T solve/apply setup currently duplicated by C32 slab +not saved = KS, QS, U, P*U, state update, state traffic +``` + +Do not claim a speedup from this estimate alone. The result doc must say whether +the saved work is large enough to justify the scratch traffic and kernel +boundary risk. + +Result: recorded in +`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`. +The f32 `BT=32` scratch path costs 256 MiB (MoE) and 384 MiB (dense) at +`npp=2048,npl=32`, with 768 MiB and 1.125 GiB of Ai traffic respectively. + +## Task 4: Go/No-Go Decision + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + +- [x] **Step 1: Write the decision** + +Use one of these exact decisions: + +```text +GO: Phase 13 should implement a default-off global-Ai scratch prototype. +``` + +or: + +```text +NO-GO: shared-A/Ai scratch is not credible on GB10; stop GDN kernel work here. +``` + +The decision must cite the scratch size and Ai traffic estimates. + +Decision: + +```text +GO: Phase 13 should implement a default-off global-Ai scratch prototype. +``` + +Rationale: the scratch/traffic cost is high enough to require strict gates, but +not high enough to reject without a default-off prototype. + +- [x] **Step 2: If GO, write Phase 13 scope** + +If GO, create: + +```text +docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md +docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md +``` + +The Phase 13 plan must include: + +- default-off env selector, +- scratch allocation strategy, +- op gate, +- canonical MoE/dense md5 gates, +- same-session A/B, +- rejection path. + +Result: + +- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`. +- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`. + +- [x] **Step 3: If NO-GO, update final records** + +If NO-GO, update: + +- `VLLM_PARITY_FINAL.md` +- `PARITY_HANDOFF.md` + +Record that GDN kernel work on GB10 is exhausted by evidence, not assumption. + +Result: not applicable because Phase 12 is GO. The final/handoff records are +not changed to close GDN work. + +## Task 5: Verification and Commit + +**Files:** +- Modify/create the files from Task 4. + +- [x] **Step 1: Verify docs** + +Run: + +```bash +git diff --check +git status --short +``` + +Expected: + +- no whitespace errors, +- only intended docs are modified plus untracked `.claude/`. + +Result: + +- `git diff --check` exited 0. +- `/home/mudler/_git/llama.cpp` was clean. +- DGX metadata artifact existed and contained MoE/dense GGUF metadata. + +- [ ] **Step 2: Commit docs** + +For GO: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +git add -f docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md \ + docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md \ + docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md +git commit -m "docs(paged): scope GDN shared-Ai prototype" \ + -m "Assisted-by: Codex:gpt-5" +``` + +For NO-GO: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md +git commit -m "docs(paged): close GDN shared-Ai cost model" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md b/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md new file mode 100644 index 000000000..6ac3ea530 --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md @@ -0,0 +1,97 @@ +# GDN Global-Ai Prototype Design + +## Goal + +Prototype the only remaining plausible C32 GDN prefill path on GB10: compute +the per-chunk triangular inverse once into global f32 Ai scratch, then reuse it +from two `dv_tile=64` value-slab CTAs. + +## Scope + +The prototype is default-off and intentionally narrow: + +- `S_v=128` +- `BT=32` +- f32 Ai scratch +- two `dv_tile=64` value slabs +- non-KDA, final-state-only path matching the existing chunked M5 conditions +- no decode routing; `GDN_CHUNK_MIN` remains greater than 1 + +## Architecture + +The prototype splits current M5 work into two CUDA stages: + +1. `gdn_ai32_cuda`: one CTA per `(sequence, head, chunk)` computes the C32 + chunk-local triangular inverse `Ai = A^-1` and writes `[BT, BT]` f32 scratch. +2. `gdn_chunked_ai32_cuda`: one CTA per `(sequence, head, value slab)` loads Ai + for each chunk and performs the value-dependent work for its 64 output + columns. + +This mirrors the portable scheduling idea from vLLM/FLA without importing +CuteDSL, TMA, or BF16 storage. It directly tests whether sharing A/Ai across +slabs can beat the duplicated work that rejected Phase 10. + +## Scratch + +Ai scratch is sized: + +```text +n_seqs * H * ceil(n_tokens / 32) * 32 * 32 * sizeof(float) +``` + +At `npp=2048,npl=32`, this is: + +- MoE H=32: 256 MiB. +- Dense H=48: 384 MiB. + +Scratch allocation must use the existing ggml CUDA pool, be scoped to the op, +and be default-off behind an explicit env selector. + +## Selector + +Use: + +```text +GDN_GLOBAL_AI32=1 +``` + +The default path remains current C16 M5. The candidate only engages when: + +- `S_v == 128` +- `n_tokens >= GDN_CHUNK_MIN` +- `!KDA && !keep_rs_t` +- `GDN_GLOBAL_AI32=1` + +## Correctness + +The first implementation uses f32 Ai to maximize chances of md5 stability. It +must pass: + +- `test-backend-ops -b CUDA0 -o GATED_DELTA_NET` +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439` + +If md5 changes, the prototype must stop for KL before any performance claim. + +## Performance + +Compare same-session against current M5: + +```text +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 +``` + +versus: + +```text +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1 +``` + +Run MoE and dense at `npp=512,2048`, `ntg=4`, `npl=32`. + +## Decision Rule + +Accept only if the prototype is correctness-safe and improves end-to-end S_PP. +Reject if it is flat or slower. If rejected, save the diff under +`/home/mudler/bench/phase13_gdn_global_ai32/rejected/` and do not add a LocalAI +patch. diff --git a/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md b/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md new file mode 100644 index 000000000..165dbe049 --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md @@ -0,0 +1,108 @@ +# GDN Shared-A/Ai Cost Model Design + +## Context + +The last two GDN experiments closed the low-conflict shortcut space: + +- Phase 10 C32 slab M5 was md5-clean after tail-row zeroing but slower because + each value slab recomputed the per-chunk triangular work. +- Phase 11 QS-early M5 was md5-clean but still slower because moving `QS` did + not remove a tensor-core pass. + +The remaining algorithmic gap to vLLM/FLA is not another local reorder. vLLM +builds the per-chunk triangular object once, solves/inverts it once, and reuses +that result across the WY transform. llama.cpp's current C=16 M5 already +computes A/T once for the full value width inside one CTA. A wider chunk only +fits on GB10 if value columns are split into slabs, and slabs lose unless A/T +is shared across them. + +## Current Geometry + +For `S_v = 128` and f32 state: + +| Shape | Dynamic smem | +|-------|--------------| +| C16 full value width | 93,376 B / 91.19 KiB | +| C32 full value width | 127,360 B / 124.38 KiB | +| C32 with `dv_tile=64` plus U staging | 94,592 B / 92.38 KiB | + +GB10's available dynamic smem leaves enough room for C16 full-width and C32 +half-width, but not for C32 full-width. That makes a shared-A/Ai design the only +plausible C32 path. + +## Candidate Approaches + +### A. Global A/Ai Scratch Precompute + +Add a first kernel that computes `A` and `Ai` once per `(sequence, head, chunk)` +and materializes `Ai` in global scratch. A second kernel consumes `Ai` across +value slabs. + +Pros: + +- Directly targets the Phase 10 failure mode. +- Mirrors the portable part of vLLM/FLA's schedule. +- Keeps each value-slab CTA within the GB10 smem limit. + +Cons: + +- Adds at least one extra kernel boundary. +- Requires scratch allocation and lifetime management in ggml CUDA. +- Scratch is large at real batch sizes. At `npl=32`, `BT=32`, f32 Ai costs: + - H=40, T=2048: 320 MiB. + - H=48, T=2048: 384 MiB. + - H=64, T=2048: 512 MiB. +- Needs careful profiling because global scratch traffic can erase the saved + triangular recomputation. + +### B. Shared A/Ai Inside One CTA With Reduced State Residency + +Keep C32 in one CTA by moving some state or value scratch out of shared memory. + +Pros: + +- Avoids global Ai scratch and cross-kernel synchronization. +- Could keep the current single-kernel structure. + +Cons: + +- The f32 state alone is 64 KiB. Removing enough shared memory for C32 full + width likely means reading state from global during MMA tiles or reducing + state residency, which attacks the current M5 strength. +- Higher risk of lowering achieved bandwidth and breaking md5 via new ordering. + +### C. Stay C16 and Stop GDN Kernel Work on GB10 + +Accept C16 M5 as the local GB10 ceiling and redirect parity work to another +bucket or different hardware. + +Pros: + +- Avoids high-risk scratch and synchronization work. +- Matches Phase 10/11 evidence that shortcuts are now exhausted. + +Cons: + +- Leaves the GDN prefill gap open. +- Does not move toward vLLM prefill parity on GB10. + +## Recommended Phase 12 + +Run a cost-model and dry-design phase before any source patch. The phase should +produce a go/no-go decision for Approach A: + +1. Extract actual GDN head counts and chunk counts for the MoE and dense GGUFs. +2. Compute scratch sizes for `BT=32` and `BT=64` at the benchmark shapes. +3. Estimate extra global traffic: Ai write + Ai read per value slab. +4. Compare that traffic against the triangular recomputation saved by sharing + A/Ai across slabs. +5. Only if the model is plausible, write a Phase 13 implementation plan for a + default-off global-scratch prototype. + +## Decision Rule + +Proceed to implementation only if the model shows a credible net win at +`npp=2048, npl=32` without unreasonable memory growth. If the estimated scratch +traffic or kernel-boundary overhead is close to the saved work, record a no-go +and stop GDN kernel work on GB10 rather than adding a large patch that is likely +to be rejected.