docs(paged): scope GDN global Ai32 prototype

Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 01:38:51 +00:00
parent 1b5ae227eb
commit adabd11919
9 changed files with 1159 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -983,3 +983,45 @@ Conclusion:
 - The next GDN attempt should skip local scheduling-only changes and scope a
  true shared-A/Ai blocked-solve or global-scratch design, with an explicit
  scratch/synchronization cost model before coding.
+
+## Phase 12 GDN Shared-A/Ai Cost Model
+
+Phase 12 evaluated whether a real shared-A/Ai design is credible enough to
+prototype after the C32 slab and QS-early shortcut rejections.
+
+Cost-model doc:
+
+- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
+
+Metadata artifact:
+
+- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
+
+Model dimensions:
+
+| Model | GDN layers | H | S_v | Metadata basis |
+|-------|------------|---|-----|----------------|
+| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` |
+| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` |
+
+Dynamic-smem result for `S_v=128`:
+
+| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
+|-------|-------|-----|-------------------------|
+| C16 full-width | 93,376 | 91.19 | yes |
+| C32 full-width | 127,360 | 124.38 | no |
+| C32 slab64 + U staging | 94,592 | 92.38 | yes |
+
+Ai scratch result at `npp=2048,npl=32,BT=32,f32`:
+
+| Model | Ai scratch MiB | 3x Ai traffic MiB |
+|-------|----------------|-------------------|
+| MoE | 256.0 | 768.0 |
+| Dense | 384.0 | 1152.0 |
+
+Decision:
+
+- GO for a default-off Phase 13 global-Ai32 prototype.
+- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
+- The prototype must be rejected if it is flat or slower; do not iterate into
+  f16/BF16 Ai unless f32 proves the schedule can win.
--- a/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md
@@ -0,0 +1,142 @@
+# GDN Shared-A/Ai Cost Model
+
+Phase 12 decides whether the next GDN prefill attempt should implement a
+shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10.
+
+## Reference Points
+
+llama.cpp:
+
+- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+  - `gated_delta_net_chunked_cuda`
+  - `launch_gdn_chunked`
+  - `launch_gated_delta_net`
+  - `ggml_cuda_op_gated_delta_net`
+
+vLLM/FLA:
+
+- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
+  - `chunk_gated_delta_rule_fwd`
+- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
+  - `solve_tril`
+  - `solve_tril_16x16_kernel`
+  - `merge_16x16_to_32x32_inverse_kernel`
+  - `merge_16x16_to_64x64_inverse_kernel`
+- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
+  - `recompute_w_u_fwd`
+
+## Metadata
+
+DGX metadata artifact:
+
+- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
+
+GGUF metadata:
+
+| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads |
+|-------|------|--------|--------------------|------------|-----------|-----------|-----------|
+| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred |
+| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred |
+
+Notes:
+
+- `GDN heads = ssm.inner_size / ssm.state_size`.
+- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal
+  layers, with 30 GDN layers at interval 4.
+- Dense has 64 layers, 48 GDN layers at interval 4.
+
+## Dynamic Shared Memory
+
+Formula:
+
+```text
+C16 full-width current M5:
+  floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
+
+C32 full-width:
+  floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
+
+C32 slab64 with U staging:
+  floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
+```
+
+For `S_v=128`:
+
+| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
+|-------|-------|-----|-------------------------|
+| C16 full-width | 93,376 | 91.19 | yes |
+| C32 full-width | 127,360 | 124.38 | no |
+| C32 slab64 + U staging | 94,592 | 92.38 | yes |
+
+Implication:
+
+- C32 full-width cannot be a single current-style CTA on GB10.
+- C32 only fits by splitting value columns or by changing state residency.
+- Splitting value columns must share A/Ai or it repeats the Phase 10 failure.
+
+## Ai Scratch Size
+
+Formula:
+
+```text
+Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
+```
+
+Benchmark shape: `npl=32`, `S_v=128`.
+
+| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB |
+|-------|---|-----|----|----------|--------|----------------|-------------------|
+| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 |
+| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 |
+| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 |
+| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 |
+| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 |
+| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 |
+| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 |
+| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 |
+| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 |
+| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 |
+| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 |
+| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 |
+| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 |
+| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 |
+| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 |
+| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 |
+
+`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs.
+
+## Interpretation
+
+The f32 `BT=32` scratch path is large but plausible:
+
+- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`.
+- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call.
+- This is not free on LPDDR5x, but it is not automatically worse than
+  recomputing A/Ai in every value slab.
+
+The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and
+Phase 11 showed correctness must be established before performance. The first
+prototype should store Ai in f32, stay default-off, and use md5/KL gates before
+trying a lossy Ai dtype.
+
+## Decision
+
+GO: Phase 13 should implement a default-off global-Ai scratch prototype.
+
+Rationale:
+
+- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai
+  across value slabs.
+- `BT=32` f32 scratch has acceptable peak memory for the existing GB10
+  benchmark shapes.
+- The implementation can be default-off and rejected cleanly if global scratch
+  traffic or extra launch boundaries dominate.
+
+Phase 13 constraints:
+
+- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs.
+- Keep decode out via `GDN_CHUNK_MIN > 1`.
+- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B.
+- If md5 changes, run KL before benchmarking.
+- If the prototype is flat or slower, reject it and stop GDN kernel work on
+  GB10; do not iterate into f16 Ai until f32 proves the schedule can win.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -175,9 +175,14 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
 | bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
 | Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
 | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
+| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |

 Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.

+Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
+`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
+on GB10 instead of iterating into f16 Ai or more local reorders.
+
 ### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
 | Lever | What | Verdict | Key number |
 |---|---|---|---|
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
@@ -174,6 +174,7 @@ products through tensor cores. The series chased that headroom.
 | bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
 | Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
 | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
+| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |

 **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
 intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
@@ -185,6 +186,12 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
 that and is the shipped winner; it does not fully close the 2.62x because vLLM's
 mature FLA blocked-solve is a more complete tensor-core implementation.

+Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
+default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
+but not automatically disqualifying. If that prototype is flat or slower, GDN
+kernel work on GB10 should stop rather than moving to f16 Ai or additional
+local reorders.
+
 ### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)

 **Methodology correction - why every earlier decode decomposition was wrong.**
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -521,6 +521,34 @@ Artifacts:
 - `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
 - `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`

+### Phase 12 GDN shared-A/Ai cost-model update
+
+Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per
+`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs.
+
+Cost model:
+
+- C16 full-width M5 uses `93,376 B` dynamic smem.
+- C32 full-width would need `127,360 B`, which does not fit GB10.
+- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is
+  recomputed per slab.
+- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is:
+  - MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic.
+  - Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic.
+
+Decision:
+
+- **GO** to a default-off Phase 13 prototype, not a shipped patch.
+- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs.
+- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work
+  on GB10 rather than iterating into f16 Ai or more local reorders.
+
+Docs:
+
+- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
+- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
+- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
+
 ---

 # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
--- a/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md
+++ b/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md
@@ -0,0 +1,398 @@
+# GDN Global-Ai Prototype Phase 13 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Implement and test a default-off C32 GDN prefill prototype that computes f32 Ai once per chunk/head and reuses it across two value slabs.
+
+**Architecture:** The prototype adds one Ai precompute kernel plus one Ai-consuming chunked kernel in `gated_delta_net.cu`. Scratch is allocated from the existing ggml CUDA pool in `ggml_cuda_op_gated_delta_net`, scoped to the op, and only used when `GDN_GLOBAL_AI32=1`.
+
+**Tech Stack:** llama.cpp CUDA, ggml CUDA pool allocator, GB10 DGX benchmark harness, Qwen3.6 NVFP4 GGUF gates.
+
+---
+
+## Guardrails
+
+- Default path remains current C16 M5.
+- Candidate engages only with `GDN_GLOBAL_AI32=1`.
+- Prototype only supports `S_v=128`, `C=32`, `DV_TILE=64`, f32 Ai.
+- Keep `GDN_CHUNK_MIN > 1`; decode must never use this path.
+- Do not add f16/BF16 Ai until f32 Ai wins.
+- Do not generate a LocalAI patch unless the fork implementation passes gates
+  and improves S_PP.
+
+## Task 1: Preflight
+
+**Files:**
+- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/`
+
+- [ ] **Step 1: Check DGX is free**
+
+Run:
+
+```bash
+ssh dgx.casa 'set -e
+echo docker=$(docker ps -q | wc -l)
+echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true)
+echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l)
+if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi'
+```
+
+Expected:
+
+```text
+docker=0
+local_ai_worker=0
+compute=0
+FREE...
+```
+
+- [ ] **Step 2: Record provenance**
+
+Run:
+
+```bash
+git -C /home/mudler/_git/llama.cpp status --short
+git -C /home/mudler/_git/llama.cpp rev-parse HEAD
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD'
+```
+
+Expected: both llama.cpp trees are clean.
+
+- [ ] **Step 3: Create artifacts**
+
+Run:
+
+```bash
+ssh dgx.casa 'mkdir -p /home/mudler/bench/phase13_gdn_global_ai32/{gates,ab,rejected}'
+```
+
+Expected: command exits 0.
+
+## Task 2: Add Ai Scratch Plumbing
+
+**Files:**
+- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+
+- [ ] **Step 1: Add env selector in `ggml_cuda_op_gated_delta_net`**
+
+Add after `keep_rs` is computed:
+
+```cpp
+static const bool gdn_global_ai32 = []{
+    const char * e = getenv("GDN_GLOBAL_AI32");
+    return e && atoi(e) != 0;
+}();
+```
+
+- [ ] **Step 2: Allocate Ai scratch only for supported calls**
+
+Add:
+
+```cpp
+float * ai32_d = nullptr;
+int64_t ai32_chunks = 0;
+ggml_cuda_pool_alloc<float> ai32_scratch(ctx.pool());
+if (gdn_global_ai32 && !kda && !keep_rs && S_v == 128 && n_tokens > 1) {
+    ai32_chunks = (n_tokens + 31) / 32;
+    ai32_d = ai32_scratch.alloc((size_t) n_seqs * H * ai32_chunks * 32 * 32);
+}
+```
+
+Pass `ai32_d` and `ai32_chunks` into the non-KDA/non-keep launch call only.
+Other launch calls pass `nullptr, 0`.
+
+- [ ] **Step 3: Extend `launch_gated_delta_net` signature**
+
+Change the signature to include:
+
+```cpp
+float * ai32_d, int64_t ai32_chunks,
+```
+
+before `float scale`. Thread these through all four call sites.
+
+## Task 3: Add Ai Precompute Kernel
+
+**Files:**
+- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+
+- [ ] **Step 1: Add `gdn_ai32_cuda`**
+
+Add a kernel near `gated_delta_net_chunked_cuda`:
+
+```cpp
+template <int S_v, int C>
+__global__ void gdn_ai32_cuda(
+        const float * __restrict__ k,
+        const float * __restrict__ g,
+        const float * __restrict__ beta,
+        float * __restrict__ ai,
+        int64_t H, int64_t n_tokens, int64_t n_seqs,
+        int64_t sq1, int64_t sq2, int64_t sq3,
+        int64_t sb1, int64_t sb2, int64_t sb3,
+        uint3 neqk1_magic, uint3 rq3_magic) {
+    // CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=chunk.
+    // Shared: Kc[C*S_v], A[C*C], csh[C], gam[C], bet[C], KKsh[C*C].
+    // Compute Kc, prefix csh/gam, KK, A, then exact f32 inverse into ai.
+}
+```
+
+The inverse algorithm must match the existing M5 f32 inverse:
+
+```cpp
+if (j < C) {
+    if (j < Cc) {
+        float x[C];
+        for (int r = 0; r < C; r++) x[r] = 0.0f;
+        x[j] = 1.0f;
+        for (int r = j + 1; r < Cc; r++) {
+            float acc = 0.0f;
+            for (int m = j; m < r; m++) acc += A[r * C + m] * x[m];
+            x[r] = -acc;
+        }
+        for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = x[r];
+    } else {
+        for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = 0.0f;
+    }
+}
+```
+
+Use fixed stride `C` in scratch, zeroing out-of-range tail rows/columns.
+
+- [ ] **Step 2: Add launcher**
+
+Add:
+
+```cpp
+template <int S_v, int C>
+static void launch_gdn_ai32(..., float * ai32_d, int64_t ai32_chunks, cudaStream_t stream)
+```
+
+Launch grid:
+
+```cpp
+dim3 grid_dims(H, n_seqs, ai32_chunks);
+dim3 block_dims(S_v, 1, 1);
+```
+
+Dynamic smem:
+
+```cpp
+((size_t) C * S_v + (size_t) C * C + (size_t) 3 * C + (size_t) C * C) * sizeof(float)
+```
+
+## Task 4: Add Ai-Consuming C32 Slab Kernel
+
+**Files:**
+- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+
+- [ ] **Step 1: Add `gated_delta_net_chunked_ai32_cuda`**
+
+Add a separate kernel rather than overloading the shipped M5 body:
+
+```cpp
+template <int S_v, int C, int DV_TILE>
+__global__ void gated_delta_net_chunked_ai32_cuda(
+        const float * __restrict__ q,
+        const float * __restrict__ k,
+        const float * __restrict__ v,
+        const float * __restrict__ g,
+        const float * __restrict__ beta,
+        const float * __restrict__ curr_state,
+        float * __restrict__ dst,
+        const float * __restrict__ ai,
+        int64_t H, int64_t n_tokens, int64_t n_seqs,
+        int64_t sq1, int64_t sq2, int64_t sq3,
+        int64_t sv1, int64_t sv2, int64_t sv3,
+        int64_t sb1, int64_t sb2, int64_t sb3,
+        uint3 neqk1_magic, uint3 rq3_magic,
+        float scale, float * __restrict__ state_dst,
+        const int32_t * __restrict__ ids, int rs_head) {
+    // CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=value slab.
+    // C=32, DV_TILE=64.
+    // Load the full source state stride S_v*S_v but own only columns [slab*DV_TILE, +DV_TILE).
+    // For every chunk, load Kc/Qc/csh/gam/bet, build RHS, load Ai, apply U = Ai*RHS,
+    // build P from QK, compute O, update owned state columns, write owned state columns.
+}
+```
+
+Use the Phase 10 tail-row fix:
+
+```cpp
+Ud[j * C + t] = (t < Cc) ? staged_value : 0.0f;
+```
+
+and use full state stride for reads/writes:
+
+```cpp
+(int64_t) seq * H * S_v * S_v + (int64_t) h_idx * S_v * S_v
+```
+
+- [ ] **Step 2: Add launcher**
+
+Add:
+
+```cpp
+template <int S_v, int C, int DV_TILE>
+static void launch_gdn_chunked_ai32(..., const float * ai32_d, int64_t ai32_chunks, ...)
+```
+
+Launch grid:
+
+```cpp
+dim3 grid_dims(H, n_seqs, S_v / DV_TILE);
+dim3 block_dims(DV_TILE, 1, 1);
+```
+
+The smem formula must stay under the C32 slab Phase 10 budget:
+
+```cpp
+((size_t) S_v * DV_TILE + (size_t) 2 * C * S_v + (size_t) DV_TILE * C
+ + (size_t) C * C + (size_t) 3 * C + (size_t) C * C
+ + (size_t) DV_TILE * C) * sizeof(float)
+```
+
+## Task 5: Route Candidate
+
+**Files:**
+- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+
+- [ ] **Step 1: Add route in `launch_gated_delta_net`**
+
+Before the existing `GDN_CHUNKED_LAUNCH` switch:
+
+```cpp
+if (ai32_d != nullptr && ai32_chunks > 0 && S_v == 128 && n_tokens >= gdn_chunk_min) {
+    launch_gdn_ai32<128, 32>(...);
+    launch_gdn_chunked_ai32<128, 32, 64>(...);
+    return;
+}
+```
+
+The route must require `!KDA && !keep_rs_t` via the existing template branch and
+must not trigger for decode-sized calls.
+
+- [ ] **Step 2: Keep default path unchanged**
+
+Run:
+
+```bash
+git diff -- ggml/src/ggml-cuda/gated_delta_net.cu
+```
+
+Check that default `GDN_TC=5` still launches `launch_gdn_chunked<128, 16, 4>`.
+
+## Task 6: Build and Correctness Gates
+
+**Files:**
+- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
+
+- [ ] **Step 1: Mirror and build**
+
+Run:
+
+```bash
+rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \
+  dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8'
+```
+
+Expected: build exits 0.
+
+- [ ] **Step 2: Run op gates**
+
+Run:
+
+```bash
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
+ART=$HOME/bench/phase13_gdn_global_ai32/gates
+./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1
+GDN_GLOBAL_AI32=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_global_ai32.txt" 2>&1'
+```
+
+Expected: both logs show CUDA0 OK for all cases.
+
+- [ ] **Step 3: Run canonical md5 gates**
+
+Run default and candidate MoE/dense completion gates. Expected:
+
+```text
+MoE   8cb0ce23777bf55f92f63d0292c756b0
+Dense 5951a5b4d624ce891e22ab5fca9bc439
+```
+
+If candidate md5 differs, run the KL gate before benchmarking.
+
+## Task 7: Performance A/B
+
+**Files:**
+- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
+
+- [ ] **Step 1: Run same-session A/B**
+
+Run MoE and dense:
+
+```bash
+LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1"
+LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1 GGML_NO_BACKTRACE=1"
+```
+
+Use:
+
+```bash
+./llama-batched-bench -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32
+```
+
+Expected: candidate improves S_PP without dense regression.
+
+- [ ] **Step 2: Decide**
+
+Accept only if:
+
+- op gate passes,
+- md5 is canonical or KL-benign,
+- MoE S_PP improves,
+- dense S_PP does not regress outside noise.
+
+Reject if flat or slower.
+
+## Task 8: Mirror or Reject
+
+**Files:**
+- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+
+- [ ] **Step 1: If accepted, commit fork patch and generate LocalAI patch**
+
+Run:
+
+```bash
+git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/gated_delta_net.cu
+git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): add GDN global Ai32 prefill prototype"
+git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD --stdout \
+  > backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-GDN-global-Ai32-prefill-prototype.patch
+```
+
+- [ ] **Step 2: If rejected, save diff and restore**
+
+Run:
+
+```bash
+git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \
+  > /home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff
+git -C /home/mudler/_git/llama.cpp checkout -- ggml/src/ggml-cuda/gated_delta_net.cu
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git checkout -- ggml/src/ggml-cuda/gated_delta_net.cu'
+```
+
+- [ ] **Step 3: Commit LocalAI docs**
+
+Commit accepted patch/docs or rejected docs with:
+
+```bash
+git commit -m "docs(paged): record GDN global Ai32 result" \
+  -m "Assisted-by: Codex:gpt-5"
+```
--- a/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
+++ b/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
@@ -0,0 +1,332 @@
+# GDN Shared-A/Ai Cost Model Phase 12 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Decide whether a shared-A/Ai C32 GDN design is worth implementing on GB10 before touching llama.cpp source.
+
+**Architecture:** Phase 12 is analysis-first and docs-only unless the cost model proves a credible win. It extracts model dimensions, computes dynamic-smem and global-scratch pressure, estimates traffic saved versus traffic added, and writes a go/no-go decision for a possible Phase 13 global-scratch prototype.
+
+**Tech Stack:** llama.cpp CUDA GDN kernel geometry, vLLM/FLA chunked GDN references, DGX GB10 benchmark artifacts, LocalAI parity docs.
+
+---
+
+## Guardrails
+
+- Do not edit llama.cpp source in this phase.
+- Do not generate a LocalAI patch file in this phase.
+- Treat Phase 10 and Phase 11 as rejected; do not reopen C32 slab or QS-early.
+- Use actual model metadata where available; if a dimension is inferred, mark it
+  as inferred.
+- The output is a go/no-go decision, not an implementation patch.
+
+## Task 1: Gather Current Evidence
+
+**Files:**
+- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
+- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
+- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
+- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
+- Read: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/`
+
+- [x] **Step 1: Check tree state**
+
+Run:
+
+```bash
+git -C /home/mudler/_git/llama.cpp status --short
+git -C /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention status --short
+```
+
+Expected:
+
+- llama.cpp fork is clean.
+- LocalAI worktree only has this Phase 12 docs work and untracked `.claude/`.
+
+- [x] **Step 2: Create artifact directory**
+
+Run:
+
+```bash
+ssh dgx.casa 'mkdir -p /home/mudler/bench/phase12_gdn_shared_ai_cost_model'
+```
+
+Expected: command exits 0.
+
+- [x] **Step 3: Record reference function map**
+
+Record these llama.cpp insertion points in the result doc:
+
+```text
+/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu
+  gated_delta_net_chunked_cuda
+  launch_gdn_chunked
+  launch_gated_delta_net
+  ggml_cuda_op_gated_delta_net
+```
+
+Record these vLLM reference functions:
+
+```text
+/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py
+  chunk_gated_delta_rule_fwd
+/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py
+  solve_tril
+  solve_tril_16x16_kernel
+  merge_16x16_to_32x32_inverse_kernel
+  merge_16x16_to_64x64_inverse_kernel
+/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py
+  recompute_w_u_fwd
+```
+
+Result: recorded in
+`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`.
+
+## Task 2: Extract Model Dimensions
+
+**Files:**
+- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
+
+- [x] **Step 1: Extract GGUF metadata**
+
+Run on DGX:
+
+```bash
+ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
+{
+  echo "=== MoE ==="
+  ./llama-show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -n 0 2>/dev/null || true
+  echo "=== Dense ==="
+  ./llama-show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf -n 0 2>/dev/null || true
+} > /home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt'
+```
+
+Expected: metadata file contains head count, layer count, and head dimension
+or enough tensor metadata to infer them.
+
+Result:
+
+- Metadata artifact:
+  `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`.
+- `llama-show-info` was not present in the DGX build, so a minimal read-only
+  GGUF metadata parser was used.
+
+- [x] **Step 2: Summarize GDN dimensions**
+
+Write a short table in the result doc:
+
+```text
+Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64
+```
+
+Use benchmark shapes:
+
+- `npl=32`
+- `npp=512,2048`
+- `S_v=128`
+
+If H cannot be read directly from metadata, infer it from source/model docs and
+mark the row as inferred.
+
+Result:
+
+| Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64 |
+|-------|------------|---|-----|---------------|-----|-----------------|-----------------|
+| MoE | 30 inferred | 32 inferred | 128 | 32 | 512 | 16 | 8 |
+| MoE | 30 inferred | 32 inferred | 128 | 32 | 2048 | 64 | 32 |
+| Dense | 48 inferred | 48 inferred | 128 | 32 | 512 | 16 | 8 |
+| Dense | 48 inferred | 48 inferred | 128 | 32 | 2048 | 64 | 32 |
+
+`H = ssm.inner_size / ssm.state_size`.
+
+## Task 3: Compute Smem and Scratch Costs
+
+**Files:**
+- Create: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
+
+- [x] **Step 1: Record dynamic-smem formulas**
+
+Use:
+
+```text
+C16 full-width current M5:
+  floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
+
+C32 full-width:
+  floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
+
+C32 slab64 with U staging:
+  floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
+```
+
+Expected values for `S_v=128`:
+
+```text
+C16 full-width:  93,376 B / 91.19 KiB
+C32 full-width: 127,360 B / 124.38 KiB
+C32 slab64:      94,592 B / 92.38 KiB
+```
+
+- [x] **Step 2: Record Ai scratch formulas**
+
+Use:
+
+```text
+Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
+```
+
+Compute for:
+
+- `BT=32`, f32 and f16/bf16 Ai.
+- `BT=64`, f32 and f16/bf16 Ai.
+- `npp=512` and `npp=2048`.
+
+- [x] **Step 3: Estimate extra global traffic**
+
+For a two-slab C32 design, estimate:
+
+```text
+Ai write once = npl * H * nchunks * BT * BT * sizeof(Ai)
+Ai read per slab = 2 * Ai write once
+total Ai traffic = 3 * Ai write once
+```
+
+Record the estimate in MiB for every benchmark shape.
+
+- [x] **Step 4: Estimate work saved**
+
+Record that shared Ai saves duplicated A/T construction per second slab:
+
+```text
+saved per chunk/head = one KK/QK-derived A/T solve/apply setup currently duplicated by C32 slab
+not saved = KS, QS, U, P*U, state update, state traffic
+```
+
+Do not claim a speedup from this estimate alone. The result doc must say whether
+the saved work is large enough to justify the scratch traffic and kernel
+boundary risk.
+
+Result: recorded in
+`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`.
+The f32 `BT=32` scratch path costs 256 MiB (MoE) and 384 MiB (dense) at
+`npp=2048,npl=32`, with 768 MiB and 1.125 GiB of Ai traffic respectively.
+
+## Task 4: Go/No-Go Decision
+
+**Files:**
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+
+- [x] **Step 1: Write the decision**
+
+Use one of these exact decisions:
+
+```text
+GO: Phase 13 should implement a default-off global-Ai scratch prototype.
+```
+
+or:
+
+```text
+NO-GO: shared-A/Ai scratch is not credible on GB10; stop GDN kernel work here.
+```
+
+The decision must cite the scratch size and Ai traffic estimates.
+
+Decision:
+
+```text
+GO: Phase 13 should implement a default-off global-Ai scratch prototype.
+```
+
+Rationale: the scratch/traffic cost is high enough to require strict gates, but
+not high enough to reject without a default-off prototype.
+
+- [x] **Step 2: If GO, write Phase 13 scope**
+
+If GO, create:
+
+```text
+docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md
+docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md
+```
+
+The Phase 13 plan must include:
+
+- default-off env selector,
+- scratch allocation strategy,
+- op gate,
+- canonical MoE/dense md5 gates,
+- same-session A/B,
+- rejection path.
+
+Result:
+
+- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`.
+- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`.
+
+- [x] **Step 3: If NO-GO, update final records**
+
+If NO-GO, update:
+
+- `VLLM_PARITY_FINAL.md`
+- `PARITY_HANDOFF.md`
+
+Record that GDN kernel work on GB10 is exhausted by evidence, not assumption.
+
+Result: not applicable because Phase 12 is GO. The final/handoff records are
+not changed to close GDN work.
+
+## Task 5: Verification and Commit
+
+**Files:**
+- Modify/create the files from Task 4.
+
+- [x] **Step 1: Verify docs**
+
+Run:
+
+```bash
+git diff --check
+git status --short
+```
+
+Expected:
+
+- no whitespace errors,
+- only intended docs are modified plus untracked `.claude/`.
+
+Result:
+
+- `git diff --check` exited 0.
+- `/home/mudler/_git/llama.cpp` was clean.
+- DGX metadata artifact existed and contained MoE/dense GGUF metadata.
+
+- [ ] **Step 2: Commit docs**
+
+For GO:
+
+```bash
+git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+git add -f docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md \
+  docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md \
+  docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
+git commit -m "docs(paged): scope GDN shared-Ai prototype" \
+  -m "Assisted-by: Codex:gpt-5"
+```
+
+For NO-GO:
+
+```bash
+git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
+  backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+  backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \
+  backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+git add -f docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
+git commit -m "docs(paged): close GDN shared-Ai cost model" \
+  -m "Assisted-by: Codex:gpt-5"
+```
--- a/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md
+++ b/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md
@@ -0,0 +1,97 @@
+# GDN Global-Ai Prototype Design
+
+## Goal
+
+Prototype the only remaining plausible C32 GDN prefill path on GB10: compute
+the per-chunk triangular inverse once into global f32 Ai scratch, then reuse it
+from two `dv_tile=64` value-slab CTAs.
+
+## Scope
+
+The prototype is default-off and intentionally narrow:
+
+- `S_v=128`
+- `BT=32`
+- f32 Ai scratch
+- two `dv_tile=64` value slabs
+- non-KDA, final-state-only path matching the existing chunked M5 conditions
+- no decode routing; `GDN_CHUNK_MIN` remains greater than 1
+
+## Architecture
+
+The prototype splits current M5 work into two CUDA stages:
+
+1. `gdn_ai32_cuda`: one CTA per `(sequence, head, chunk)` computes the C32
+   chunk-local triangular inverse `Ai = A^-1` and writes `[BT, BT]` f32 scratch.
+2. `gdn_chunked_ai32_cuda`: one CTA per `(sequence, head, value slab)` loads Ai
+   for each chunk and performs the value-dependent work for its 64 output
+   columns.
+
+This mirrors the portable scheduling idea from vLLM/FLA without importing
+CuteDSL, TMA, or BF16 storage. It directly tests whether sharing A/Ai across
+slabs can beat the duplicated work that rejected Phase 10.
+
+## Scratch
+
+Ai scratch is sized:
+
+```text
+n_seqs * H * ceil(n_tokens / 32) * 32 * 32 * sizeof(float)
+```
+
+At `npp=2048,npl=32`, this is:
+
+- MoE H=32: 256 MiB.
+- Dense H=48: 384 MiB.
+
+Scratch allocation must use the existing ggml CUDA pool, be scoped to the op,
+and be default-off behind an explicit env selector.
+
+## Selector
+
+Use:
+
+```text
+GDN_GLOBAL_AI32=1
+```
+
+The default path remains current C16 M5. The candidate only engages when:
+
+- `S_v == 128`
+- `n_tokens >= GDN_CHUNK_MIN`
+- `!KDA && !keep_rs_t`
+- `GDN_GLOBAL_AI32=1`
+
+## Correctness
+
+The first implementation uses f32 Ai to maximize chances of md5 stability. It
+must pass:
+
+- `test-backend-ops -b CUDA0 -o GATED_DELTA_NET`
+- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
+- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
+
+If md5 changes, the prototype must stop for KL before any performance claim.
+
+## Performance
+
+Compare same-session against current M5:
+
+```text
+LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64
+```
+
+versus:
+
+```text
+LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1
+```
+
+Run MoE and dense at `npp=512,2048`, `ntg=4`, `npl=32`.
+
+## Decision Rule
+
+Accept only if the prototype is correctness-safe and improves end-to-end S_PP.
+Reject if it is flat or slower. If rejected, save the diff under
+`/home/mudler/bench/phase13_gdn_global_ai32/rejected/` and do not add a LocalAI
+patch.
--- a/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md
+++ b/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md
@@ -0,0 +1,108 @@
+# GDN Shared-A/Ai Cost Model Design
+
+## Context
+
+The last two GDN experiments closed the low-conflict shortcut space:
+
+- Phase 10 C32 slab M5 was md5-clean after tail-row zeroing but slower because
+  each value slab recomputed the per-chunk triangular work.
+- Phase 11 QS-early M5 was md5-clean but still slower because moving `QS` did
+  not remove a tensor-core pass.
+
+The remaining algorithmic gap to vLLM/FLA is not another local reorder. vLLM
+builds the per-chunk triangular object once, solves/inverts it once, and reuses
+that result across the WY transform. llama.cpp's current C=16 M5 already
+computes A/T once for the full value width inside one CTA. A wider chunk only
+fits on GB10 if value columns are split into slabs, and slabs lose unless A/T
+is shared across them.
+
+## Current Geometry
+
+For `S_v = 128` and f32 state:
+
+| Shape | Dynamic smem |
+|-------|--------------|
+| C16 full value width | 93,376 B / 91.19 KiB |
+| C32 full value width | 127,360 B / 124.38 KiB |
+| C32 with `dv_tile=64` plus U staging | 94,592 B / 92.38 KiB |
+
+GB10's available dynamic smem leaves enough room for C16 full-width and C32
+half-width, but not for C32 full-width. That makes a shared-A/Ai design the only
+plausible C32 path.
+
+## Candidate Approaches
+
+### A. Global A/Ai Scratch Precompute
+
+Add a first kernel that computes `A` and `Ai` once per `(sequence, head, chunk)`
+and materializes `Ai` in global scratch. A second kernel consumes `Ai` across
+value slabs.
+
+Pros:
+
+- Directly targets the Phase 10 failure mode.
+- Mirrors the portable part of vLLM/FLA's schedule.
+- Keeps each value-slab CTA within the GB10 smem limit.
+
+Cons:
+
+- Adds at least one extra kernel boundary.
+- Requires scratch allocation and lifetime management in ggml CUDA.
+- Scratch is large at real batch sizes. At `npl=32`, `BT=32`, f32 Ai costs:
+  - H=40, T=2048: 320 MiB.
+  - H=48, T=2048: 384 MiB.
+  - H=64, T=2048: 512 MiB.
+- Needs careful profiling because global scratch traffic can erase the saved
+  triangular recomputation.
+
+### B. Shared A/Ai Inside One CTA With Reduced State Residency
+
+Keep C32 in one CTA by moving some state or value scratch out of shared memory.
+
+Pros:
+
+- Avoids global Ai scratch and cross-kernel synchronization.
+- Could keep the current single-kernel structure.
+
+Cons:
+
+- The f32 state alone is 64 KiB. Removing enough shared memory for C32 full
+  width likely means reading state from global during MMA tiles or reducing
+  state residency, which attacks the current M5 strength.
+- Higher risk of lowering achieved bandwidth and breaking md5 via new ordering.
+
+### C. Stay C16 and Stop GDN Kernel Work on GB10
+
+Accept C16 M5 as the local GB10 ceiling and redirect parity work to another
+bucket or different hardware.
+
+Pros:
+
+- Avoids high-risk scratch and synchronization work.
+- Matches Phase 10/11 evidence that shortcuts are now exhausted.
+
+Cons:
+
+- Leaves the GDN prefill gap open.
+- Does not move toward vLLM prefill parity on GB10.
+
+## Recommended Phase 12
+
+Run a cost-model and dry-design phase before any source patch. The phase should
+produce a go/no-go decision for Approach A:
+
+1. Extract actual GDN head counts and chunk counts for the MoE and dense GGUFs.
+2. Compute scratch sizes for `BT=32` and `BT=64` at the benchmark shapes.
+3. Estimate extra global traffic: Ai write + Ai read per value slab.
+4. Compare that traffic against the triangular recomputation saved by sharing
+   A/Ai across slabs.
+5. Only if the model is plausible, write a Phase 13 implementation plan for a
+   default-off global-scratch prototype.
+
+## Decision Rule
+
+Proceed to implementation only if the model shows a credible net win at
+`npp=2048, npl=32` without unreasonable memory growth. If the estimated scratch
+traffic or kernel-boundary overhead is close to the saved work, record a no-go
+and stop GDN kernel work on GB10 rather than adding a large patch that is likely
+to be rejected.