docs(paged): scope GDN global Ai32 prototype

Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 01:38:51 +00:00
parent 1b5ae227eb
commit adabd11919
9 changed files with 1159 additions and 0 deletions

View File

@@ -983,3 +983,45 @@ Conclusion:
- The next GDN attempt should skip local scheduling-only changes and scope a
true shared-A/Ai blocked-solve or global-scratch design, with an explicit
scratch/synchronization cost model before coding.
## Phase 12 GDN Shared-A/Ai Cost Model
Phase 12 evaluated whether a real shared-A/Ai design is credible enough to
prototype after the C32 slab and QS-early shortcut rejections.
Cost-model doc:
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
Metadata artifact:
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
Model dimensions:
| Model | GDN layers | H | S_v | Metadata basis |
|-------|------------|---|-----|----------------|
| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` |
| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` |
Dynamic-smem result for `S_v=128`:
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|-------|-------|-----|-------------------------|
| C16 full-width | 93,376 | 91.19 | yes |
| C32 full-width | 127,360 | 124.38 | no |
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
Ai scratch result at `npp=2048,npl=32,BT=32,f32`:
| Model | Ai scratch MiB | 3x Ai traffic MiB |
|-------|----------------|-------------------|
| MoE | 256.0 | 768.0 |
| Dense | 384.0 | 1152.0 |
Decision:
- GO for a default-off Phase 13 global-Ai32 prototype.
- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
- The prototype must be rejected if it is flat or slower; do not iterate into
f16/BF16 Ai unless f32 proves the schedule can win.

View File

@@ -0,0 +1,142 @@
# GDN Shared-A/Ai Cost Model
Phase 12 decides whether the next GDN prefill attempt should implement a
shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10.
## Reference Points
llama.cpp:
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- `gated_delta_net_chunked_cuda`
- `launch_gdn_chunked`
- `launch_gated_delta_net`
- `ggml_cuda_op_gated_delta_net`
vLLM/FLA:
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
- `chunk_gated_delta_rule_fwd`
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
- `solve_tril`
- `solve_tril_16x16_kernel`
- `merge_16x16_to_32x32_inverse_kernel`
- `merge_16x16_to_64x64_inverse_kernel`
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
- `recompute_w_u_fwd`
## Metadata
DGX metadata artifact:
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
GGUF metadata:
| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads |
|-------|------|--------|--------------------|------------|-----------|-----------|-----------|
| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred |
| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred |
Notes:
- `GDN heads = ssm.inner_size / ssm.state_size`.
- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal
layers, with 30 GDN layers at interval 4.
- Dense has 64 layers, 48 GDN layers at interval 4.
## Dynamic Shared Memory
Formula:
```text
C16 full-width current M5:
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
C32 full-width:
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
C32 slab64 with U staging:
floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
```
For `S_v=128`:
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|-------|-------|-----|-------------------------|
| C16 full-width | 93,376 | 91.19 | yes |
| C32 full-width | 127,360 | 124.38 | no |
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
Implication:
- C32 full-width cannot be a single current-style CTA on GB10.
- C32 only fits by splitting value columns or by changing state residency.
- Splitting value columns must share A/Ai or it repeats the Phase 10 failure.
## Ai Scratch Size
Formula:
```text
Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
```
Benchmark shape: `npl=32`, `S_v=128`.
| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB |
|-------|---|-----|----|----------|--------|----------------|-------------------|
| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 |
| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 |
| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 |
| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 |
| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 |
| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 |
| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 |
| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 |
| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 |
| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 |
| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 |
| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 |
| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 |
| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 |
| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 |
| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 |
`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs.
## Interpretation
The f32 `BT=32` scratch path is large but plausible:
- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`.
- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call.
- This is not free on LPDDR5x, but it is not automatically worse than
recomputing A/Ai in every value slab.
The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and
Phase 11 showed correctness must be established before performance. The first
prototype should store Ai in f32, stay default-off, and use md5/KL gates before
trying a lossy Ai dtype.
## Decision
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
Rationale:
- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai
across value slabs.
- `BT=32` f32 scratch has acceptable peak memory for the existing GB10
benchmark shapes.
- The implementation can be default-off and rejected cleanly if global scratch
traffic or extra launch boundaries dominate.
Phase 13 constraints:
- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs.
- Keep decode out via `GDN_CHUNK_MIN > 1`.
- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B.
- If md5 changes, run KL before benchmarking.
- If the prototype is flat or slower, reject it and stop GDN kernel work on
GB10; do not iterate into f16 Ai until f32 proves the schedule can win.

View File

@@ -175,9 +175,14 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.
Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
on GB10 instead of iterating into f16 Ai or more local reorders.
### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
| Lever | What | Verdict | Key number |
|---|---|---|---|

View File

@@ -174,6 +174,7 @@ products through tensor cores. The series chased that headroom.
| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
@@ -185,6 +186,12 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
that and is the shipped winner; it does not fully close the 2.62x because vLLM's
mature FLA blocked-solve is a more complete tensor-core implementation.
Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
but not automatically disqualifying. If that prototype is flat or slower, GDN
kernel work on GB10 should stop rather than moving to f16 Ai or additional
local reorders.
### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)
**Methodology correction - why every earlier decode decomposition was wrong.**

View File

@@ -521,6 +521,34 @@ Artifacts:
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`
### Phase 12 GDN shared-A/Ai cost-model update
Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per
`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs.
Cost model:
- C16 full-width M5 uses `93,376 B` dynamic smem.
- C32 full-width would need `127,360 B`, which does not fit GB10.
- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is
recomputed per slab.
- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is:
- MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic.
- Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic.
Decision:
- **GO** to a default-off Phase 13 prototype, not a shipped patch.
- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs.
- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work
on GB10 rather than iterating into f16 Ai or more local reorders.
Docs:
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
---
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)

View File

@@ -0,0 +1,398 @@
# GDN Global-Ai Prototype Phase 13 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Implement and test a default-off C32 GDN prefill prototype that computes f32 Ai once per chunk/head and reuses it across two value slabs.
**Architecture:** The prototype adds one Ai precompute kernel plus one Ai-consuming chunked kernel in `gated_delta_net.cu`. Scratch is allocated from the existing ggml CUDA pool in `ggml_cuda_op_gated_delta_net`, scoped to the op, and only used when `GDN_GLOBAL_AI32=1`.
**Tech Stack:** llama.cpp CUDA, ggml CUDA pool allocator, GB10 DGX benchmark harness, Qwen3.6 NVFP4 GGUF gates.
---
## Guardrails
- Default path remains current C16 M5.
- Candidate engages only with `GDN_GLOBAL_AI32=1`.
- Prototype only supports `S_v=128`, `C=32`, `DV_TILE=64`, f32 Ai.
- Keep `GDN_CHUNK_MIN > 1`; decode must never use this path.
- Do not add f16/BF16 Ai until f32 Ai wins.
- Do not generate a LocalAI patch unless the fork implementation passes gates
and improves S_PP.
## Task 1: Preflight
**Files:**
- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/`
- [ ] **Step 1: Check DGX is free**
Run:
```bash
ssh dgx.casa 'set -e
echo docker=$(docker ps -q | wc -l)
echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true)
echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l)
if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi'
```
Expected:
```text
docker=0
local_ai_worker=0
compute=0
FREE...
```
- [ ] **Step 2: Record provenance**
Run:
```bash
git -C /home/mudler/_git/llama.cpp status --short
git -C /home/mudler/_git/llama.cpp rev-parse HEAD
ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD'
```
Expected: both llama.cpp trees are clean.
- [ ] **Step 3: Create artifacts**
Run:
```bash
ssh dgx.casa 'mkdir -p /home/mudler/bench/phase13_gdn_global_ai32/{gates,ab,rejected}'
```
Expected: command exits 0.
## Task 2: Add Ai Scratch Plumbing
**Files:**
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- [ ] **Step 1: Add env selector in `ggml_cuda_op_gated_delta_net`**
Add after `keep_rs` is computed:
```cpp
static const bool gdn_global_ai32 = []{
const char * e = getenv("GDN_GLOBAL_AI32");
return e && atoi(e) != 0;
}();
```
- [ ] **Step 2: Allocate Ai scratch only for supported calls**
Add:
```cpp
float * ai32_d = nullptr;
int64_t ai32_chunks = 0;
ggml_cuda_pool_alloc<float> ai32_scratch(ctx.pool());
if (gdn_global_ai32 && !kda && !keep_rs && S_v == 128 && n_tokens > 1) {
ai32_chunks = (n_tokens + 31) / 32;
ai32_d = ai32_scratch.alloc((size_t) n_seqs * H * ai32_chunks * 32 * 32);
}
```
Pass `ai32_d` and `ai32_chunks` into the non-KDA/non-keep launch call only.
Other launch calls pass `nullptr, 0`.
- [ ] **Step 3: Extend `launch_gated_delta_net` signature**
Change the signature to include:
```cpp
float * ai32_d, int64_t ai32_chunks,
```
before `float scale`. Thread these through all four call sites.
## Task 3: Add Ai Precompute Kernel
**Files:**
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- [ ] **Step 1: Add `gdn_ai32_cuda`**
Add a kernel near `gated_delta_net_chunked_cuda`:
```cpp
template <int S_v, int C>
__global__ void gdn_ai32_cuda(
const float * __restrict__ k,
const float * __restrict__ g,
const float * __restrict__ beta,
float * __restrict__ ai,
int64_t H, int64_t n_tokens, int64_t n_seqs,
int64_t sq1, int64_t sq2, int64_t sq3,
int64_t sb1, int64_t sb2, int64_t sb3,
uint3 neqk1_magic, uint3 rq3_magic) {
// CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=chunk.
// Shared: Kc[C*S_v], A[C*C], csh[C], gam[C], bet[C], KKsh[C*C].
// Compute Kc, prefix csh/gam, KK, A, then exact f32 inverse into ai.
}
```
The inverse algorithm must match the existing M5 f32 inverse:
```cpp
if (j < C) {
if (j < Cc) {
float x[C];
for (int r = 0; r < C; r++) x[r] = 0.0f;
x[j] = 1.0f;
for (int r = j + 1; r < Cc; r++) {
float acc = 0.0f;
for (int m = j; m < r; m++) acc += A[r * C + m] * x[m];
x[r] = -acc;
}
for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = x[r];
} else {
for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = 0.0f;
}
}
```
Use fixed stride `C` in scratch, zeroing out-of-range tail rows/columns.
- [ ] **Step 2: Add launcher**
Add:
```cpp
template <int S_v, int C>
static void launch_gdn_ai32(..., float * ai32_d, int64_t ai32_chunks, cudaStream_t stream)
```
Launch grid:
```cpp
dim3 grid_dims(H, n_seqs, ai32_chunks);
dim3 block_dims(S_v, 1, 1);
```
Dynamic smem:
```cpp
((size_t) C * S_v + (size_t) C * C + (size_t) 3 * C + (size_t) C * C) * sizeof(float)
```
## Task 4: Add Ai-Consuming C32 Slab Kernel
**Files:**
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- [ ] **Step 1: Add `gated_delta_net_chunked_ai32_cuda`**
Add a separate kernel rather than overloading the shipped M5 body:
```cpp
template <int S_v, int C, int DV_TILE>
__global__ void gated_delta_net_chunked_ai32_cuda(
const float * __restrict__ q,
const float * __restrict__ k,
const float * __restrict__ v,
const float * __restrict__ g,
const float * __restrict__ beta,
const float * __restrict__ curr_state,
float * __restrict__ dst,
const float * __restrict__ ai,
int64_t H, int64_t n_tokens, int64_t n_seqs,
int64_t sq1, int64_t sq2, int64_t sq3,
int64_t sv1, int64_t sv2, int64_t sv3,
int64_t sb1, int64_t sb2, int64_t sb3,
uint3 neqk1_magic, uint3 rq3_magic,
float scale, float * __restrict__ state_dst,
const int32_t * __restrict__ ids, int rs_head) {
// CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=value slab.
// C=32, DV_TILE=64.
// Load the full source state stride S_v*S_v but own only columns [slab*DV_TILE, +DV_TILE).
// For every chunk, load Kc/Qc/csh/gam/bet, build RHS, load Ai, apply U = Ai*RHS,
// build P from QK, compute O, update owned state columns, write owned state columns.
}
```
Use the Phase 10 tail-row fix:
```cpp
Ud[j * C + t] = (t < Cc) ? staged_value : 0.0f;
```
and use full state stride for reads/writes:
```cpp
(int64_t) seq * H * S_v * S_v + (int64_t) h_idx * S_v * S_v
```
- [ ] **Step 2: Add launcher**
Add:
```cpp
template <int S_v, int C, int DV_TILE>
static void launch_gdn_chunked_ai32(..., const float * ai32_d, int64_t ai32_chunks, ...)
```
Launch grid:
```cpp
dim3 grid_dims(H, n_seqs, S_v / DV_TILE);
dim3 block_dims(DV_TILE, 1, 1);
```
The smem formula must stay under the C32 slab Phase 10 budget:
```cpp
((size_t) S_v * DV_TILE + (size_t) 2 * C * S_v + (size_t) DV_TILE * C
+ (size_t) C * C + (size_t) 3 * C + (size_t) C * C
+ (size_t) DV_TILE * C) * sizeof(float)
```
## Task 5: Route Candidate
**Files:**
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- [ ] **Step 1: Add route in `launch_gated_delta_net`**
Before the existing `GDN_CHUNKED_LAUNCH` switch:
```cpp
if (ai32_d != nullptr && ai32_chunks > 0 && S_v == 128 && n_tokens >= gdn_chunk_min) {
launch_gdn_ai32<128, 32>(...);
launch_gdn_chunked_ai32<128, 32, 64>(...);
return;
}
```
The route must require `!KDA && !keep_rs_t` via the existing template branch and
must not trigger for decode-sized calls.
- [ ] **Step 2: Keep default path unchanged**
Run:
```bash
git diff -- ggml/src/ggml-cuda/gated_delta_net.cu
```
Check that default `GDN_TC=5` still launches `launch_gdn_chunked<128, 16, 4>`.
## Task 6: Build and Correctness Gates
**Files:**
- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
- [ ] **Step 1: Mirror and build**
Run:
```bash
rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \
dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8'
```
Expected: build exits 0.
- [ ] **Step 2: Run op gates**
Run:
```bash
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
ART=$HOME/bench/phase13_gdn_global_ai32/gates
./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1
GDN_GLOBAL_AI32=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_global_ai32.txt" 2>&1'
```
Expected: both logs show CUDA0 OK for all cases.
- [ ] **Step 3: Run canonical md5 gates**
Run default and candidate MoE/dense completion gates. Expected:
```text
MoE 8cb0ce23777bf55f92f63d0292c756b0
Dense 5951a5b4d624ce891e22ab5fca9bc439
```
If candidate md5 differs, run the KL gate before benchmarking.
## Task 7: Performance A/B
**Files:**
- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
- [ ] **Step 1: Run same-session A/B**
Run MoE and dense:
```bash
LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1"
LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1 GGML_NO_BACKTRACE=1"
```
Use:
```bash
./llama-batched-bench -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32
```
Expected: candidate improves S_PP without dense regression.
- [ ] **Step 2: Decide**
Accept only if:
- op gate passes,
- md5 is canonical or KL-benign,
- MoE S_PP improves,
- dense S_PP does not regress outside noise.
Reject if flat or slower.
## Task 8: Mirror or Reject
**Files:**
- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- [ ] **Step 1: If accepted, commit fork patch and generate LocalAI patch**
Run:
```bash
git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/gated_delta_net.cu
git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): add GDN global Ai32 prefill prototype"
git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD --stdout \
> backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-GDN-global-Ai32-prefill-prototype.patch
```
- [ ] **Step 2: If rejected, save diff and restore**
Run:
```bash
git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \
> /home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff
git -C /home/mudler/_git/llama.cpp checkout -- ggml/src/ggml-cuda/gated_delta_net.cu
ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git checkout -- ggml/src/ggml-cuda/gated_delta_net.cu'
```
- [ ] **Step 3: Commit LocalAI docs**
Commit accepted patch/docs or rejected docs with:
```bash
git commit -m "docs(paged): record GDN global Ai32 result" \
-m "Assisted-by: Codex:gpt-5"
```

View File

@@ -0,0 +1,332 @@
# GDN Shared-A/Ai Cost Model Phase 12 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Decide whether a shared-A/Ai C32 GDN design is worth implementing on GB10 before touching llama.cpp source.
**Architecture:** Phase 12 is analysis-first and docs-only unless the cost model proves a credible win. It extracts model dimensions, computes dynamic-smem and global-scratch pressure, estimates traffic saved versus traffic added, and writes a go/no-go decision for a possible Phase 13 global-scratch prototype.
**Tech Stack:** llama.cpp CUDA GDN kernel geometry, vLLM/FLA chunked GDN references, DGX GB10 benchmark artifacts, LocalAI parity docs.
---
## Guardrails
- Do not edit llama.cpp source in this phase.
- Do not generate a LocalAI patch file in this phase.
- Treat Phase 10 and Phase 11 as rejected; do not reopen C32 slab or QS-early.
- Use actual model metadata where available; if a dimension is inferred, mark it
as inferred.
- The output is a go/no-go decision, not an implementation patch.
## Task 1: Gather Current Evidence
**Files:**
- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
- Read: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/`
- [x] **Step 1: Check tree state**
Run:
```bash
git -C /home/mudler/_git/llama.cpp status --short
git -C /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention status --short
```
Expected:
- llama.cpp fork is clean.
- LocalAI worktree only has this Phase 12 docs work and untracked `.claude/`.
- [x] **Step 2: Create artifact directory**
Run:
```bash
ssh dgx.casa 'mkdir -p /home/mudler/bench/phase12_gdn_shared_ai_cost_model'
```
Expected: command exits 0.
- [x] **Step 3: Record reference function map**
Record these llama.cpp insertion points in the result doc:
```text
/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu
gated_delta_net_chunked_cuda
launch_gdn_chunked
launch_gated_delta_net
ggml_cuda_op_gated_delta_net
```
Record these vLLM reference functions:
```text
/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py
chunk_gated_delta_rule_fwd
/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py
solve_tril
solve_tril_16x16_kernel
merge_16x16_to_32x32_inverse_kernel
merge_16x16_to_64x64_inverse_kernel
/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py
recompute_w_u_fwd
```
Result: recorded in
`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`.
## Task 2: Extract Model Dimensions
**Files:**
- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
- [x] **Step 1: Extract GGUF metadata**
Run on DGX:
```bash
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
{
echo "=== MoE ==="
./llama-show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -n 0 2>/dev/null || true
echo "=== Dense ==="
./llama-show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf -n 0 2>/dev/null || true
} > /home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt'
```
Expected: metadata file contains head count, layer count, and head dimension
or enough tensor metadata to infer them.
Result:
- Metadata artifact:
`/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`.
- `llama-show-info` was not present in the DGX build, so a minimal read-only
GGUF metadata parser was used.
- [x] **Step 2: Summarize GDN dimensions**
Write a short table in the result doc:
```text
Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64
```
Use benchmark shapes:
- `npl=32`
- `npp=512,2048`
- `S_v=128`
If H cannot be read directly from metadata, infer it from source/model docs and
mark the row as inferred.
Result:
| Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64 |
|-------|------------|---|-----|---------------|-----|-----------------|-----------------|
| MoE | 30 inferred | 32 inferred | 128 | 32 | 512 | 16 | 8 |
| MoE | 30 inferred | 32 inferred | 128 | 32 | 2048 | 64 | 32 |
| Dense | 48 inferred | 48 inferred | 128 | 32 | 512 | 16 | 8 |
| Dense | 48 inferred | 48 inferred | 128 | 32 | 2048 | 64 | 32 |
`H = ssm.inner_size / ssm.state_size`.
## Task 3: Compute Smem and Scratch Costs
**Files:**
- Create: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
- [x] **Step 1: Record dynamic-smem formulas**
Use:
```text
C16 full-width current M5:
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
C32 full-width:
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
C32 slab64 with U staging:
floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
```
Expected values for `S_v=128`:
```text
C16 full-width: 93,376 B / 91.19 KiB
C32 full-width: 127,360 B / 124.38 KiB
C32 slab64: 94,592 B / 92.38 KiB
```
- [x] **Step 2: Record Ai scratch formulas**
Use:
```text
Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
```
Compute for:
- `BT=32`, f32 and f16/bf16 Ai.
- `BT=64`, f32 and f16/bf16 Ai.
- `npp=512` and `npp=2048`.
- [x] **Step 3: Estimate extra global traffic**
For a two-slab C32 design, estimate:
```text
Ai write once = npl * H * nchunks * BT * BT * sizeof(Ai)
Ai read per slab = 2 * Ai write once
total Ai traffic = 3 * Ai write once
```
Record the estimate in MiB for every benchmark shape.
- [x] **Step 4: Estimate work saved**
Record that shared Ai saves duplicated A/T construction per second slab:
```text
saved per chunk/head = one KK/QK-derived A/T solve/apply setup currently duplicated by C32 slab
not saved = KS, QS, U, P*U, state update, state traffic
```
Do not claim a speedup from this estimate alone. The result doc must say whether
the saved work is large enough to justify the scratch traffic and kernel
boundary risk.
Result: recorded in
`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`.
The f32 `BT=32` scratch path costs 256 MiB (MoE) and 384 MiB (dense) at
`npp=2048,npl=32`, with 768 MiB and 1.125 GiB of Ai traffic respectively.
## Task 4: Go/No-Go Decision
**Files:**
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- [x] **Step 1: Write the decision**
Use one of these exact decisions:
```text
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
```
or:
```text
NO-GO: shared-A/Ai scratch is not credible on GB10; stop GDN kernel work here.
```
The decision must cite the scratch size and Ai traffic estimates.
Decision:
```text
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
```
Rationale: the scratch/traffic cost is high enough to require strict gates, but
not high enough to reject without a default-off prototype.
- [x] **Step 2: If GO, write Phase 13 scope**
If GO, create:
```text
docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md
docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md
```
The Phase 13 plan must include:
- default-off env selector,
- scratch allocation strategy,
- op gate,
- canonical MoE/dense md5 gates,
- same-session A/B,
- rejection path.
Result:
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`.
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`.
- [x] **Step 3: If NO-GO, update final records**
If NO-GO, update:
- `VLLM_PARITY_FINAL.md`
- `PARITY_HANDOFF.md`
Record that GDN kernel work on GB10 is exhausted by evidence, not assumption.
Result: not applicable because Phase 12 is GO. The final/handoff records are
not changed to close GDN work.
## Task 5: Verification and Commit
**Files:**
- Modify/create the files from Task 4.
- [x] **Step 1: Verify docs**
Run:
```bash
git diff --check
git status --short
```
Expected:
- no whitespace errors,
- only intended docs are modified plus untracked `.claude/`.
Result:
- `git diff --check` exited 0.
- `/home/mudler/_git/llama.cpp` was clean.
- DGX metadata artifact existed and contained MoE/dense GGUF metadata.
- [ ] **Step 2: Commit docs**
For GO:
```bash
git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
git add -f docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md \
docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md \
docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
git commit -m "docs(paged): scope GDN shared-Ai prototype" \
-m "Assisted-by: Codex:gpt-5"
```
For NO-GO:
```bash
git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
git add -f docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
git commit -m "docs(paged): close GDN shared-Ai cost model" \
-m "Assisted-by: Codex:gpt-5"
```

View File

@@ -0,0 +1,97 @@
# GDN Global-Ai Prototype Design
## Goal
Prototype the only remaining plausible C32 GDN prefill path on GB10: compute
the per-chunk triangular inverse once into global f32 Ai scratch, then reuse it
from two `dv_tile=64` value-slab CTAs.
## Scope
The prototype is default-off and intentionally narrow:
- `S_v=128`
- `BT=32`
- f32 Ai scratch
- two `dv_tile=64` value slabs
- non-KDA, final-state-only path matching the existing chunked M5 conditions
- no decode routing; `GDN_CHUNK_MIN` remains greater than 1
## Architecture
The prototype splits current M5 work into two CUDA stages:
1. `gdn_ai32_cuda`: one CTA per `(sequence, head, chunk)` computes the C32
chunk-local triangular inverse `Ai = A^-1` and writes `[BT, BT]` f32 scratch.
2. `gdn_chunked_ai32_cuda`: one CTA per `(sequence, head, value slab)` loads Ai
for each chunk and performs the value-dependent work for its 64 output
columns.
This mirrors the portable scheduling idea from vLLM/FLA without importing
CuteDSL, TMA, or BF16 storage. It directly tests whether sharing A/Ai across
slabs can beat the duplicated work that rejected Phase 10.
## Scratch
Ai scratch is sized:
```text
n_seqs * H * ceil(n_tokens / 32) * 32 * 32 * sizeof(float)
```
At `npp=2048,npl=32`, this is:
- MoE H=32: 256 MiB.
- Dense H=48: 384 MiB.
Scratch allocation must use the existing ggml CUDA pool, be scoped to the op,
and be default-off behind an explicit env selector.
## Selector
Use:
```text
GDN_GLOBAL_AI32=1
```
The default path remains current C16 M5. The candidate only engages when:
- `S_v == 128`
- `n_tokens >= GDN_CHUNK_MIN`
- `!KDA && !keep_rs_t`
- `GDN_GLOBAL_AI32=1`
## Correctness
The first implementation uses f32 Ai to maximize chances of md5 stability. It
must pass:
- `test-backend-ops -b CUDA0 -o GATED_DELTA_NET`
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
If md5 changes, the prototype must stop for KL before any performance claim.
## Performance
Compare same-session against current M5:
```text
LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64
```
versus:
```text
LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1
```
Run MoE and dense at `npp=512,2048`, `ntg=4`, `npl=32`.
## Decision Rule
Accept only if the prototype is correctness-safe and improves end-to-end S_PP.
Reject if it is flat or slower. If rejected, save the diff under
`/home/mudler/bench/phase13_gdn_global_ai32/rejected/` and do not add a LocalAI
patch.

View File

@@ -0,0 +1,108 @@
# GDN Shared-A/Ai Cost Model Design
## Context
The last two GDN experiments closed the low-conflict shortcut space:
- Phase 10 C32 slab M5 was md5-clean after tail-row zeroing but slower because
each value slab recomputed the per-chunk triangular work.
- Phase 11 QS-early M5 was md5-clean but still slower because moving `QS` did
not remove a tensor-core pass.
The remaining algorithmic gap to vLLM/FLA is not another local reorder. vLLM
builds the per-chunk triangular object once, solves/inverts it once, and reuses
that result across the WY transform. llama.cpp's current C=16 M5 already
computes A/T once for the full value width inside one CTA. A wider chunk only
fits on GB10 if value columns are split into slabs, and slabs lose unless A/T
is shared across them.
## Current Geometry
For `S_v = 128` and f32 state:
| Shape | Dynamic smem |
|-------|--------------|
| C16 full value width | 93,376 B / 91.19 KiB |
| C32 full value width | 127,360 B / 124.38 KiB |
| C32 with `dv_tile=64` plus U staging | 94,592 B / 92.38 KiB |
GB10's available dynamic smem leaves enough room for C16 full-width and C32
half-width, but not for C32 full-width. That makes a shared-A/Ai design the only
plausible C32 path.
## Candidate Approaches
### A. Global A/Ai Scratch Precompute
Add a first kernel that computes `A` and `Ai` once per `(sequence, head, chunk)`
and materializes `Ai` in global scratch. A second kernel consumes `Ai` across
value slabs.
Pros:
- Directly targets the Phase 10 failure mode.
- Mirrors the portable part of vLLM/FLA's schedule.
- Keeps each value-slab CTA within the GB10 smem limit.
Cons:
- Adds at least one extra kernel boundary.
- Requires scratch allocation and lifetime management in ggml CUDA.
- Scratch is large at real batch sizes. At `npl=32`, `BT=32`, f32 Ai costs:
- H=40, T=2048: 320 MiB.
- H=48, T=2048: 384 MiB.
- H=64, T=2048: 512 MiB.
- Needs careful profiling because global scratch traffic can erase the saved
triangular recomputation.
### B. Shared A/Ai Inside One CTA With Reduced State Residency
Keep C32 in one CTA by moving some state or value scratch out of shared memory.
Pros:
- Avoids global Ai scratch and cross-kernel synchronization.
- Could keep the current single-kernel structure.
Cons:
- The f32 state alone is 64 KiB. Removing enough shared memory for C32 full
width likely means reading state from global during MMA tiles or reducing
state residency, which attacks the current M5 strength.
- Higher risk of lowering achieved bandwidth and breaking md5 via new ordering.
### C. Stay C16 and Stop GDN Kernel Work on GB10
Accept C16 M5 as the local GB10 ceiling and redirect parity work to another
bucket or different hardware.
Pros:
- Avoids high-risk scratch and synchronization work.
- Matches Phase 10/11 evidence that shortcuts are now exhausted.
Cons:
- Leaves the GDN prefill gap open.
- Does not move toward vLLM prefill parity on GB10.
## Recommended Phase 12
Run a cost-model and dry-design phase before any source patch. The phase should
produce a go/no-go decision for Approach A:
1. Extract actual GDN head counts and chunk counts for the MoE and dense GGUFs.
2. Compute scratch sizes for `BT=32` and `BT=64` at the benchmark shapes.
3. Estimate extra global traffic: Ai write + Ai read per value slab.
4. Compare that traffic against the triangular recomputation saved by sharing
A/Ai across slabs.
5. Only if the model is plausible, write a Phase 13 implementation plan for a
default-off global-scratch prototype.
## Decision Rule
Proceed to implementation only if the model shows a credible net win at
`npp=2048, npl=32` without unreasonable memory growth. If the estimated scratch
traffic or kernel-boundary overhead is close to the saved work, record a no-go
and stop GDN kernel work on GB10 rather than adding a large patch that is likely
to be rejected.