mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): scope GDN global Ai32 prototype
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -983,3 +983,45 @@ Conclusion:
|
||||
- The next GDN attempt should skip local scheduling-only changes and scope a
|
||||
true shared-A/Ai blocked-solve or global-scratch design, with an explicit
|
||||
scratch/synchronization cost model before coding.
|
||||
|
||||
## Phase 12 GDN Shared-A/Ai Cost Model
|
||||
|
||||
Phase 12 evaluated whether a real shared-A/Ai design is credible enough to
|
||||
prototype after the C32 slab and QS-early shortcut rejections.
|
||||
|
||||
Cost-model doc:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
|
||||
|
||||
Metadata artifact:
|
||||
|
||||
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
|
||||
|
||||
Model dimensions:
|
||||
|
||||
| Model | GDN layers | H | S_v | Metadata basis |
|
||||
|-------|------------|---|-----|----------------|
|
||||
| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` |
|
||||
| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` |
|
||||
|
||||
Dynamic-smem result for `S_v=128`:
|
||||
|
||||
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|
||||
|-------|-------|-----|-------------------------|
|
||||
| C16 full-width | 93,376 | 91.19 | yes |
|
||||
| C32 full-width | 127,360 | 124.38 | no |
|
||||
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
|
||||
|
||||
Ai scratch result at `npp=2048,npl=32,BT=32,f32`:
|
||||
|
||||
| Model | Ai scratch MiB | 3x Ai traffic MiB |
|
||||
|-------|----------------|-------------------|
|
||||
| MoE | 256.0 | 768.0 |
|
||||
| Dense | 384.0 | 1152.0 |
|
||||
|
||||
Decision:
|
||||
|
||||
- GO for a default-off Phase 13 global-Ai32 prototype.
|
||||
- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`.
|
||||
- The prototype must be rejected if it is flat or slower; do not iterate into
|
||||
f16/BF16 Ai unless f32 proves the schedule can win.
|
||||
|
||||
@@ -0,0 +1,142 @@
|
||||
# GDN Shared-A/Ai Cost Model
|
||||
|
||||
Phase 12 decides whether the next GDN prefill attempt should implement a
|
||||
shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10.
|
||||
|
||||
## Reference Points
|
||||
|
||||
llama.cpp:
|
||||
|
||||
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
- `gated_delta_net_chunked_cuda`
|
||||
- `launch_gdn_chunked`
|
||||
- `launch_gated_delta_net`
|
||||
- `ggml_cuda_op_gated_delta_net`
|
||||
|
||||
vLLM/FLA:
|
||||
|
||||
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
|
||||
- `chunk_gated_delta_rule_fwd`
|
||||
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
|
||||
- `solve_tril`
|
||||
- `solve_tril_16x16_kernel`
|
||||
- `merge_16x16_to_32x32_inverse_kernel`
|
||||
- `merge_16x16_to_64x64_inverse_kernel`
|
||||
- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
|
||||
- `recompute_w_u_fwd`
|
||||
|
||||
## Metadata
|
||||
|
||||
DGX metadata artifact:
|
||||
|
||||
- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
|
||||
|
||||
GGUF metadata:
|
||||
|
||||
| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads |
|
||||
|-------|------|--------|--------------------|------------|-----------|-----------|-----------|
|
||||
| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred |
|
||||
| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred |
|
||||
|
||||
Notes:
|
||||
|
||||
- `GDN heads = ssm.inner_size / ssm.state_size`.
|
||||
- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal
|
||||
layers, with 30 GDN layers at interval 4.
|
||||
- Dense has 64 layers, 48 GDN layers at interval 4.
|
||||
|
||||
## Dynamic Shared Memory
|
||||
|
||||
Formula:
|
||||
|
||||
```text
|
||||
C16 full-width current M5:
|
||||
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
|
||||
|
||||
C32 full-width:
|
||||
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
|
||||
|
||||
C32 slab64 with U staging:
|
||||
floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
|
||||
```
|
||||
|
||||
For `S_v=128`:
|
||||
|
||||
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|
||||
|-------|-------|-----|-------------------------|
|
||||
| C16 full-width | 93,376 | 91.19 | yes |
|
||||
| C32 full-width | 127,360 | 124.38 | no |
|
||||
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
|
||||
|
||||
Implication:
|
||||
|
||||
- C32 full-width cannot be a single current-style CTA on GB10.
|
||||
- C32 only fits by splitting value columns or by changing state residency.
|
||||
- Splitting value columns must share A/Ai or it repeats the Phase 10 failure.
|
||||
|
||||
## Ai Scratch Size
|
||||
|
||||
Formula:
|
||||
|
||||
```text
|
||||
Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
|
||||
```
|
||||
|
||||
Benchmark shape: `npl=32`, `S_v=128`.
|
||||
|
||||
| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB |
|
||||
|-------|---|-----|----|----------|--------|----------------|-------------------|
|
||||
| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 |
|
||||
| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 |
|
||||
| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 |
|
||||
| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 |
|
||||
| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 |
|
||||
| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 |
|
||||
| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 |
|
||||
| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 |
|
||||
| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 |
|
||||
| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 |
|
||||
| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 |
|
||||
| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 |
|
||||
| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 |
|
||||
| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 |
|
||||
| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 |
|
||||
| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 |
|
||||
|
||||
`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs.
|
||||
|
||||
## Interpretation
|
||||
|
||||
The f32 `BT=32` scratch path is large but plausible:
|
||||
|
||||
- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`.
|
||||
- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call.
|
||||
- This is not free on LPDDR5x, but it is not automatically worse than
|
||||
recomputing A/Ai in every value slab.
|
||||
|
||||
The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and
|
||||
Phase 11 showed correctness must be established before performance. The first
|
||||
prototype should store Ai in f32, stay default-off, and use md5/KL gates before
|
||||
trying a lossy Ai dtype.
|
||||
|
||||
## Decision
|
||||
|
||||
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
|
||||
|
||||
Rationale:
|
||||
|
||||
- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai
|
||||
across value slabs.
|
||||
- `BT=32` f32 scratch has acceptable peak memory for the existing GB10
|
||||
benchmark shapes.
|
||||
- The implementation can be default-off and rejected cleanly if global scratch
|
||||
traffic or extra launch boundaries dominate.
|
||||
|
||||
Phase 13 constraints:
|
||||
|
||||
- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs.
|
||||
- Keep decode out via `GDN_CHUNK_MIN > 1`.
|
||||
- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B.
|
||||
- If md5 changes, run KL before benchmarking.
|
||||
- If the prototype is flat or slower, reject it and stop GDN kernel work on
|
||||
GB10; do not iterate into f16 Ai until f32 proves the schedule can win.
|
||||
@@ -175,9 +175,14 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
|
||||
| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
|
||||
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
|
||||
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
|
||||
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic |
|
||||
|
||||
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.
|
||||
|
||||
Phase 12 caveat: this is not a shipped win. It authorizes only a default-off
|
||||
`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work
|
||||
on GB10 instead of iterating into f16 Ai or more local reorders.
|
||||
|
||||
### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady)
|
||||
| Lever | What | Verdict | Key number |
|
||||
|---|---|---|---|
|
||||
|
||||
@@ -174,6 +174,7 @@ products through tensor cores. The series chased that headroom.
|
||||
| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
|
||||
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
|
||||
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
|
||||
| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model |
|
||||
|
||||
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
|
||||
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
|
||||
@@ -185,6 +186,12 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly
|
||||
that and is the shipped winner; it does not fully close the 2.62x because vLLM's
|
||||
mature FLA blocked-solve is a more complete tensor-core implementation.
|
||||
|
||||
Post-record caveat: Phase 12 does not change the shipped verdict. It permits one
|
||||
default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large
|
||||
but not automatically disqualifying. If that prototype is flat or slower, GDN
|
||||
kernel work on GB10 should stop rather than moving to f16 Ai or additional
|
||||
local reorders.
|
||||
|
||||
### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact)
|
||||
|
||||
**Methodology correction - why every earlier decode decomposition was wrong.**
|
||||
|
||||
@@ -521,6 +521,34 @@ Artifacts:
|
||||
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
|
||||
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`
|
||||
|
||||
### Phase 12 GDN shared-A/Ai cost-model update
|
||||
|
||||
Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per
|
||||
`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs.
|
||||
|
||||
Cost model:
|
||||
|
||||
- C16 full-width M5 uses `93,376 B` dynamic smem.
|
||||
- C32 full-width would need `127,360 B`, which does not fit GB10.
|
||||
- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is
|
||||
recomputed per slab.
|
||||
- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is:
|
||||
- MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic.
|
||||
- Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic.
|
||||
|
||||
Decision:
|
||||
|
||||
- **GO** to a default-off Phase 13 prototype, not a shipped patch.
|
||||
- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs.
|
||||
- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work
|
||||
on GB10 rather than iterating into f16 Ai or more local reorders.
|
||||
|
||||
Docs:
|
||||
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
|
||||
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`
|
||||
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`
|
||||
|
||||
---
|
||||
|
||||
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
|
||||
|
||||
@@ -0,0 +1,398 @@
|
||||
# GDN Global-Ai Prototype Phase 13 Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Implement and test a default-off C32 GDN prefill prototype that computes f32 Ai once per chunk/head and reuses it across two value slabs.
|
||||
|
||||
**Architecture:** The prototype adds one Ai precompute kernel plus one Ai-consuming chunked kernel in `gated_delta_net.cu`. Scratch is allocated from the existing ggml CUDA pool in `ggml_cuda_op_gated_delta_net`, scoped to the op, and only used when `GDN_GLOBAL_AI32=1`.
|
||||
|
||||
**Tech Stack:** llama.cpp CUDA, ggml CUDA pool allocator, GB10 DGX benchmark harness, Qwen3.6 NVFP4 GGUF gates.
|
||||
|
||||
---
|
||||
|
||||
## Guardrails
|
||||
|
||||
- Default path remains current C16 M5.
|
||||
- Candidate engages only with `GDN_GLOBAL_AI32=1`.
|
||||
- Prototype only supports `S_v=128`, `C=32`, `DV_TILE=64`, f32 Ai.
|
||||
- Keep `GDN_CHUNK_MIN > 1`; decode must never use this path.
|
||||
- Do not add f16/BF16 Ai until f32 Ai wins.
|
||||
- Do not generate a LocalAI patch unless the fork implementation passes gates
|
||||
and improves S_PP.
|
||||
|
||||
## Task 1: Preflight
|
||||
|
||||
**Files:**
|
||||
- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/`
|
||||
|
||||
- [ ] **Step 1: Check DGX is free**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'set -e
|
||||
echo docker=$(docker ps -q | wc -l)
|
||||
echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true)
|
||||
echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l)
|
||||
if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi'
|
||||
```
|
||||
|
||||
Expected:
|
||||
|
||||
```text
|
||||
docker=0
|
||||
local_ai_worker=0
|
||||
compute=0
|
||||
FREE...
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Record provenance**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
git -C /home/mudler/_git/llama.cpp status --short
|
||||
git -C /home/mudler/_git/llama.cpp rev-parse HEAD
|
||||
ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD'
|
||||
```
|
||||
|
||||
Expected: both llama.cpp trees are clean.
|
||||
|
||||
- [ ] **Step 3: Create artifacts**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'mkdir -p /home/mudler/bench/phase13_gdn_global_ai32/{gates,ab,rejected}'
|
||||
```
|
||||
|
||||
Expected: command exits 0.
|
||||
|
||||
## Task 2: Add Ai Scratch Plumbing
|
||||
|
||||
**Files:**
|
||||
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
|
||||
- [ ] **Step 1: Add env selector in `ggml_cuda_op_gated_delta_net`**
|
||||
|
||||
Add after `keep_rs` is computed:
|
||||
|
||||
```cpp
|
||||
static const bool gdn_global_ai32 = []{
|
||||
const char * e = getenv("GDN_GLOBAL_AI32");
|
||||
return e && atoi(e) != 0;
|
||||
}();
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Allocate Ai scratch only for supported calls**
|
||||
|
||||
Add:
|
||||
|
||||
```cpp
|
||||
float * ai32_d = nullptr;
|
||||
int64_t ai32_chunks = 0;
|
||||
ggml_cuda_pool_alloc<float> ai32_scratch(ctx.pool());
|
||||
if (gdn_global_ai32 && !kda && !keep_rs && S_v == 128 && n_tokens > 1) {
|
||||
ai32_chunks = (n_tokens + 31) / 32;
|
||||
ai32_d = ai32_scratch.alloc((size_t) n_seqs * H * ai32_chunks * 32 * 32);
|
||||
}
|
||||
```
|
||||
|
||||
Pass `ai32_d` and `ai32_chunks` into the non-KDA/non-keep launch call only.
|
||||
Other launch calls pass `nullptr, 0`.
|
||||
|
||||
- [ ] **Step 3: Extend `launch_gated_delta_net` signature**
|
||||
|
||||
Change the signature to include:
|
||||
|
||||
```cpp
|
||||
float * ai32_d, int64_t ai32_chunks,
|
||||
```
|
||||
|
||||
before `float scale`. Thread these through all four call sites.
|
||||
|
||||
## Task 3: Add Ai Precompute Kernel
|
||||
|
||||
**Files:**
|
||||
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
|
||||
- [ ] **Step 1: Add `gdn_ai32_cuda`**
|
||||
|
||||
Add a kernel near `gated_delta_net_chunked_cuda`:
|
||||
|
||||
```cpp
|
||||
template <int S_v, int C>
|
||||
__global__ void gdn_ai32_cuda(
|
||||
const float * __restrict__ k,
|
||||
const float * __restrict__ g,
|
||||
const float * __restrict__ beta,
|
||||
float * __restrict__ ai,
|
||||
int64_t H, int64_t n_tokens, int64_t n_seqs,
|
||||
int64_t sq1, int64_t sq2, int64_t sq3,
|
||||
int64_t sb1, int64_t sb2, int64_t sb3,
|
||||
uint3 neqk1_magic, uint3 rq3_magic) {
|
||||
// CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=chunk.
|
||||
// Shared: Kc[C*S_v], A[C*C], csh[C], gam[C], bet[C], KKsh[C*C].
|
||||
// Compute Kc, prefix csh/gam, KK, A, then exact f32 inverse into ai.
|
||||
}
|
||||
```
|
||||
|
||||
The inverse algorithm must match the existing M5 f32 inverse:
|
||||
|
||||
```cpp
|
||||
if (j < C) {
|
||||
if (j < Cc) {
|
||||
float x[C];
|
||||
for (int r = 0; r < C; r++) x[r] = 0.0f;
|
||||
x[j] = 1.0f;
|
||||
for (int r = j + 1; r < Cc; r++) {
|
||||
float acc = 0.0f;
|
||||
for (int m = j; m < r; m++) acc += A[r * C + m] * x[m];
|
||||
x[r] = -acc;
|
||||
}
|
||||
for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = x[r];
|
||||
} else {
|
||||
for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = 0.0f;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Use fixed stride `C` in scratch, zeroing out-of-range tail rows/columns.
|
||||
|
||||
- [ ] **Step 2: Add launcher**
|
||||
|
||||
Add:
|
||||
|
||||
```cpp
|
||||
template <int S_v, int C>
|
||||
static void launch_gdn_ai32(..., float * ai32_d, int64_t ai32_chunks, cudaStream_t stream)
|
||||
```
|
||||
|
||||
Launch grid:
|
||||
|
||||
```cpp
|
||||
dim3 grid_dims(H, n_seqs, ai32_chunks);
|
||||
dim3 block_dims(S_v, 1, 1);
|
||||
```
|
||||
|
||||
Dynamic smem:
|
||||
|
||||
```cpp
|
||||
((size_t) C * S_v + (size_t) C * C + (size_t) 3 * C + (size_t) C * C) * sizeof(float)
|
||||
```
|
||||
|
||||
## Task 4: Add Ai-Consuming C32 Slab Kernel
|
||||
|
||||
**Files:**
|
||||
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
|
||||
- [ ] **Step 1: Add `gated_delta_net_chunked_ai32_cuda`**
|
||||
|
||||
Add a separate kernel rather than overloading the shipped M5 body:
|
||||
|
||||
```cpp
|
||||
template <int S_v, int C, int DV_TILE>
|
||||
__global__ void gated_delta_net_chunked_ai32_cuda(
|
||||
const float * __restrict__ q,
|
||||
const float * __restrict__ k,
|
||||
const float * __restrict__ v,
|
||||
const float * __restrict__ g,
|
||||
const float * __restrict__ beta,
|
||||
const float * __restrict__ curr_state,
|
||||
float * __restrict__ dst,
|
||||
const float * __restrict__ ai,
|
||||
int64_t H, int64_t n_tokens, int64_t n_seqs,
|
||||
int64_t sq1, int64_t sq2, int64_t sq3,
|
||||
int64_t sv1, int64_t sv2, int64_t sv3,
|
||||
int64_t sb1, int64_t sb2, int64_t sb3,
|
||||
uint3 neqk1_magic, uint3 rq3_magic,
|
||||
float scale, float * __restrict__ state_dst,
|
||||
const int32_t * __restrict__ ids, int rs_head) {
|
||||
// CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=value slab.
|
||||
// C=32, DV_TILE=64.
|
||||
// Load the full source state stride S_v*S_v but own only columns [slab*DV_TILE, +DV_TILE).
|
||||
// For every chunk, load Kc/Qc/csh/gam/bet, build RHS, load Ai, apply U = Ai*RHS,
|
||||
// build P from QK, compute O, update owned state columns, write owned state columns.
|
||||
}
|
||||
```
|
||||
|
||||
Use the Phase 10 tail-row fix:
|
||||
|
||||
```cpp
|
||||
Ud[j * C + t] = (t < Cc) ? staged_value : 0.0f;
|
||||
```
|
||||
|
||||
and use full state stride for reads/writes:
|
||||
|
||||
```cpp
|
||||
(int64_t) seq * H * S_v * S_v + (int64_t) h_idx * S_v * S_v
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add launcher**
|
||||
|
||||
Add:
|
||||
|
||||
```cpp
|
||||
template <int S_v, int C, int DV_TILE>
|
||||
static void launch_gdn_chunked_ai32(..., const float * ai32_d, int64_t ai32_chunks, ...)
|
||||
```
|
||||
|
||||
Launch grid:
|
||||
|
||||
```cpp
|
||||
dim3 grid_dims(H, n_seqs, S_v / DV_TILE);
|
||||
dim3 block_dims(DV_TILE, 1, 1);
|
||||
```
|
||||
|
||||
The smem formula must stay under the C32 slab Phase 10 budget:
|
||||
|
||||
```cpp
|
||||
((size_t) S_v * DV_TILE + (size_t) 2 * C * S_v + (size_t) DV_TILE * C
|
||||
+ (size_t) C * C + (size_t) 3 * C + (size_t) C * C
|
||||
+ (size_t) DV_TILE * C) * sizeof(float)
|
||||
```
|
||||
|
||||
## Task 5: Route Candidate
|
||||
|
||||
**Files:**
|
||||
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
|
||||
- [ ] **Step 1: Add route in `launch_gated_delta_net`**
|
||||
|
||||
Before the existing `GDN_CHUNKED_LAUNCH` switch:
|
||||
|
||||
```cpp
|
||||
if (ai32_d != nullptr && ai32_chunks > 0 && S_v == 128 && n_tokens >= gdn_chunk_min) {
|
||||
launch_gdn_ai32<128, 32>(...);
|
||||
launch_gdn_chunked_ai32<128, 32, 64>(...);
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
The route must require `!KDA && !keep_rs_t` via the existing template branch and
|
||||
must not trigger for decode-sized calls.
|
||||
|
||||
- [ ] **Step 2: Keep default path unchanged**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
git diff -- ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
```
|
||||
|
||||
Check that default `GDN_TC=5` still launches `launch_gdn_chunked<128, 16, 4>`.
|
||||
|
||||
## Task 6: Build and Correctness Gates
|
||||
|
||||
**Files:**
|
||||
- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/gates/`
|
||||
|
||||
- [ ] **Step 1: Mirror and build**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \
|
||||
dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8'
|
||||
```
|
||||
|
||||
Expected: build exits 0.
|
||||
|
||||
- [ ] **Step 2: Run op gates**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
|
||||
ART=$HOME/bench/phase13_gdn_global_ai32/gates
|
||||
./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1
|
||||
GDN_GLOBAL_AI32=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_global_ai32.txt" 2>&1'
|
||||
```
|
||||
|
||||
Expected: both logs show CUDA0 OK for all cases.
|
||||
|
||||
- [ ] **Step 3: Run canonical md5 gates**
|
||||
|
||||
Run default and candidate MoE/dense completion gates. Expected:
|
||||
|
||||
```text
|
||||
MoE 8cb0ce23777bf55f92f63d0292c756b0
|
||||
Dense 5951a5b4d624ce891e22ab5fca9bc439
|
||||
```
|
||||
|
||||
If candidate md5 differs, run the KL gate before benchmarking.
|
||||
|
||||
## Task 7: Performance A/B
|
||||
|
||||
**Files:**
|
||||
- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
|
||||
|
||||
- [ ] **Step 1: Run same-session A/B**
|
||||
|
||||
Run MoE and dense:
|
||||
|
||||
```bash
|
||||
LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1"
|
||||
LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1 GGML_NO_BACKTRACE=1"
|
||||
```
|
||||
|
||||
Use:
|
||||
|
||||
```bash
|
||||
./llama-batched-bench -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32
|
||||
```
|
||||
|
||||
Expected: candidate improves S_PP without dense regression.
|
||||
|
||||
- [ ] **Step 2: Decide**
|
||||
|
||||
Accept only if:
|
||||
|
||||
- op gate passes,
|
||||
- md5 is canonical or KL-benign,
|
||||
- MoE S_PP improves,
|
||||
- dense S_PP does not regress outside noise.
|
||||
|
||||
Reject if flat or slower.
|
||||
|
||||
## Task 8: Mirror or Reject
|
||||
|
||||
**Files:**
|
||||
- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
|
||||
- [ ] **Step 1: If accepted, commit fork patch and generate LocalAI patch**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): add GDN global Ai32 prefill prototype"
|
||||
git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD --stdout \
|
||||
> backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-GDN-global-Ai32-prefill-prototype.patch
|
||||
```
|
||||
|
||||
- [ ] **Step 2: If rejected, save diff and restore**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \
|
||||
> /home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff
|
||||
git -C /home/mudler/_git/llama.cpp checkout -- ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git checkout -- ggml/src/ggml-cuda/gated_delta_net.cu'
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Commit LocalAI docs**
|
||||
|
||||
Commit accepted patch/docs or rejected docs with:
|
||||
|
||||
```bash
|
||||
git commit -m "docs(paged): record GDN global Ai32 result" \
|
||||
-m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
@@ -0,0 +1,332 @@
|
||||
# GDN Shared-A/Ai Cost Model Phase 12 Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Decide whether a shared-A/Ai C32 GDN design is worth implementing on GB10 before touching llama.cpp source.
|
||||
|
||||
**Architecture:** Phase 12 is analysis-first and docs-only unless the cost model proves a credible win. It extracts model dimensions, computes dynamic-smem and global-scratch pressure, estimates traffic saved versus traffic added, and writes a go/no-go decision for a possible Phase 13 global-scratch prototype.
|
||||
|
||||
**Tech Stack:** llama.cpp CUDA GDN kernel geometry, vLLM/FLA chunked GDN references, DGX GB10 benchmark artifacts, LocalAI parity docs.
|
||||
|
||||
---
|
||||
|
||||
## Guardrails
|
||||
|
||||
- Do not edit llama.cpp source in this phase.
|
||||
- Do not generate a LocalAI patch file in this phase.
|
||||
- Treat Phase 10 and Phase 11 as rejected; do not reopen C32 slab or QS-early.
|
||||
- Use actual model metadata where available; if a dimension is inferred, mark it
|
||||
as inferred.
|
||||
- The output is a go/no-go decision, not an implementation patch.
|
||||
|
||||
## Task 1: Gather Current Evidence
|
||||
|
||||
**Files:**
|
||||
- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu`
|
||||
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py`
|
||||
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py`
|
||||
- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/`
|
||||
|
||||
- [x] **Step 1: Check tree state**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
git -C /home/mudler/_git/llama.cpp status --short
|
||||
git -C /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention status --short
|
||||
```
|
||||
|
||||
Expected:
|
||||
|
||||
- llama.cpp fork is clean.
|
||||
- LocalAI worktree only has this Phase 12 docs work and untracked `.claude/`.
|
||||
|
||||
- [x] **Step 2: Create artifact directory**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'mkdir -p /home/mudler/bench/phase12_gdn_shared_ai_cost_model'
|
||||
```
|
||||
|
||||
Expected: command exits 0.
|
||||
|
||||
- [x] **Step 3: Record reference function map**
|
||||
|
||||
Record these llama.cpp insertion points in the result doc:
|
||||
|
||||
```text
|
||||
/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu
|
||||
gated_delta_net_chunked_cuda
|
||||
launch_gdn_chunked
|
||||
launch_gated_delta_net
|
||||
ggml_cuda_op_gated_delta_net
|
||||
```
|
||||
|
||||
Record these vLLM reference functions:
|
||||
|
||||
```text
|
||||
/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py
|
||||
chunk_gated_delta_rule_fwd
|
||||
/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py
|
||||
solve_tril
|
||||
solve_tril_16x16_kernel
|
||||
merge_16x16_to_32x32_inverse_kernel
|
||||
merge_16x16_to_64x64_inverse_kernel
|
||||
/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py
|
||||
recompute_w_u_fwd
|
||||
```
|
||||
|
||||
Result: recorded in
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`.
|
||||
|
||||
## Task 2: Extract Model Dimensions
|
||||
|
||||
**Files:**
|
||||
- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`
|
||||
|
||||
- [x] **Step 1: Extract GGUF metadata**
|
||||
|
||||
Run on DGX:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin
|
||||
{
|
||||
echo "=== MoE ==="
|
||||
./llama-show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -n 0 2>/dev/null || true
|
||||
echo "=== Dense ==="
|
||||
./llama-show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf -n 0 2>/dev/null || true
|
||||
} > /home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt'
|
||||
```
|
||||
|
||||
Expected: metadata file contains head count, layer count, and head dimension
|
||||
or enough tensor metadata to infer them.
|
||||
|
||||
Result:
|
||||
|
||||
- Metadata artifact:
|
||||
`/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`.
|
||||
- `llama-show-info` was not present in the DGX build, so a minimal read-only
|
||||
GGUF metadata parser was used.
|
||||
|
||||
- [x] **Step 2: Summarize GDN dimensions**
|
||||
|
||||
Write a short table in the result doc:
|
||||
|
||||
```text
|
||||
Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64
|
||||
```
|
||||
|
||||
Use benchmark shapes:
|
||||
|
||||
- `npl=32`
|
||||
- `npp=512,2048`
|
||||
- `S_v=128`
|
||||
|
||||
If H cannot be read directly from metadata, infer it from source/model docs and
|
||||
mark the row as inferred.
|
||||
|
||||
Result:
|
||||
|
||||
| Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64 |
|
||||
|-------|------------|---|-----|---------------|-----|-----------------|-----------------|
|
||||
| MoE | 30 inferred | 32 inferred | 128 | 32 | 512 | 16 | 8 |
|
||||
| MoE | 30 inferred | 32 inferred | 128 | 32 | 2048 | 64 | 32 |
|
||||
| Dense | 48 inferred | 48 inferred | 128 | 32 | 512 | 16 | 8 |
|
||||
| Dense | 48 inferred | 48 inferred | 128 | 32 | 2048 | 64 | 32 |
|
||||
|
||||
`H = ssm.inner_size / ssm.state_size`.
|
||||
|
||||
## Task 3: Compute Smem and Scratch Costs
|
||||
|
||||
**Files:**
|
||||
- Create: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
|
||||
|
||||
- [x] **Step 1: Record dynamic-smem formulas**
|
||||
|
||||
Use:
|
||||
|
||||
```text
|
||||
C16 full-width current M5:
|
||||
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
|
||||
|
||||
C32 full-width:
|
||||
floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C
|
||||
|
||||
C32 slab64 with U staging:
|
||||
floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C
|
||||
```
|
||||
|
||||
Expected values for `S_v=128`:
|
||||
|
||||
```text
|
||||
C16 full-width: 93,376 B / 91.19 KiB
|
||||
C32 full-width: 127,360 B / 124.38 KiB
|
||||
C32 slab64: 94,592 B / 92.38 KiB
|
||||
```
|
||||
|
||||
- [x] **Step 2: Record Ai scratch formulas**
|
||||
|
||||
Use:
|
||||
|
||||
```text
|
||||
Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype)
|
||||
```
|
||||
|
||||
Compute for:
|
||||
|
||||
- `BT=32`, f32 and f16/bf16 Ai.
|
||||
- `BT=64`, f32 and f16/bf16 Ai.
|
||||
- `npp=512` and `npp=2048`.
|
||||
|
||||
- [x] **Step 3: Estimate extra global traffic**
|
||||
|
||||
For a two-slab C32 design, estimate:
|
||||
|
||||
```text
|
||||
Ai write once = npl * H * nchunks * BT * BT * sizeof(Ai)
|
||||
Ai read per slab = 2 * Ai write once
|
||||
total Ai traffic = 3 * Ai write once
|
||||
```
|
||||
|
||||
Record the estimate in MiB for every benchmark shape.
|
||||
|
||||
- [x] **Step 4: Estimate work saved**
|
||||
|
||||
Record that shared Ai saves duplicated A/T construction per second slab:
|
||||
|
||||
```text
|
||||
saved per chunk/head = one KK/QK-derived A/T solve/apply setup currently duplicated by C32 slab
|
||||
not saved = KS, QS, U, P*U, state update, state traffic
|
||||
```
|
||||
|
||||
Do not claim a speedup from this estimate alone. The result doc must say whether
|
||||
the saved work is large enough to justify the scratch traffic and kernel
|
||||
boundary risk.
|
||||
|
||||
Result: recorded in
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`.
|
||||
The f32 `BT=32` scratch path costs 256 MiB (MoE) and 384 MiB (dense) at
|
||||
`npp=2048,npl=32`, with 768 MiB and 1.125 GiB of Ai traffic respectively.
|
||||
|
||||
## Task 4: Go/No-Go Decision
|
||||
|
||||
**Files:**
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
|
||||
- [x] **Step 1: Write the decision**
|
||||
|
||||
Use one of these exact decisions:
|
||||
|
||||
```text
|
||||
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```text
|
||||
NO-GO: shared-A/Ai scratch is not credible on GB10; stop GDN kernel work here.
|
||||
```
|
||||
|
||||
The decision must cite the scratch size and Ai traffic estimates.
|
||||
|
||||
Decision:
|
||||
|
||||
```text
|
||||
GO: Phase 13 should implement a default-off global-Ai scratch prototype.
|
||||
```
|
||||
|
||||
Rationale: the scratch/traffic cost is high enough to require strict gates, but
|
||||
not high enough to reject without a default-off prototype.
|
||||
|
||||
- [x] **Step 2: If GO, write Phase 13 scope**
|
||||
|
||||
If GO, create:
|
||||
|
||||
```text
|
||||
docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md
|
||||
docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md
|
||||
```
|
||||
|
||||
The Phase 13 plan must include:
|
||||
|
||||
- default-off env selector,
|
||||
- scratch allocation strategy,
|
||||
- op gate,
|
||||
- canonical MoE/dense md5 gates,
|
||||
- same-session A/B,
|
||||
- rejection path.
|
||||
|
||||
Result:
|
||||
|
||||
- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`.
|
||||
- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`.
|
||||
|
||||
- [x] **Step 3: If NO-GO, update final records**
|
||||
|
||||
If NO-GO, update:
|
||||
|
||||
- `VLLM_PARITY_FINAL.md`
|
||||
- `PARITY_HANDOFF.md`
|
||||
|
||||
Record that GDN kernel work on GB10 is exhausted by evidence, not assumption.
|
||||
|
||||
Result: not applicable because Phase 12 is GO. The final/handoff records are
|
||||
not changed to close GDN work.
|
||||
|
||||
## Task 5: Verification and Commit
|
||||
|
||||
**Files:**
|
||||
- Modify/create the files from Task 4.
|
||||
|
||||
- [x] **Step 1: Verify docs**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
git diff --check
|
||||
git status --short
|
||||
```
|
||||
|
||||
Expected:
|
||||
|
||||
- no whitespace errors,
|
||||
- only intended docs are modified plus untracked `.claude/`.
|
||||
|
||||
Result:
|
||||
|
||||
- `git diff --check` exited 0.
|
||||
- `/home/mudler/_git/llama.cpp` was clean.
|
||||
- DGX metadata artifact existed and contained MoE/dense GGUF metadata.
|
||||
|
||||
- [ ] **Step 2: Commit docs**
|
||||
|
||||
For GO:
|
||||
|
||||
```bash
|
||||
git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
|
||||
git add -f docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md \
|
||||
docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md \
|
||||
docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
|
||||
git commit -m "docs(paged): scope GDN shared-Ai prototype" \
|
||||
-m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
|
||||
For NO-GO:
|
||||
|
||||
```bash
|
||||
git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
|
||||
git add -f docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md
|
||||
git commit -m "docs(paged): close GDN shared-Ai cost model" \
|
||||
-m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
@@ -0,0 +1,97 @@
|
||||
# GDN Global-Ai Prototype Design
|
||||
|
||||
## Goal
|
||||
|
||||
Prototype the only remaining plausible C32 GDN prefill path on GB10: compute
|
||||
the per-chunk triangular inverse once into global f32 Ai scratch, then reuse it
|
||||
from two `dv_tile=64` value-slab CTAs.
|
||||
|
||||
## Scope
|
||||
|
||||
The prototype is default-off and intentionally narrow:
|
||||
|
||||
- `S_v=128`
|
||||
- `BT=32`
|
||||
- f32 Ai scratch
|
||||
- two `dv_tile=64` value slabs
|
||||
- non-KDA, final-state-only path matching the existing chunked M5 conditions
|
||||
- no decode routing; `GDN_CHUNK_MIN` remains greater than 1
|
||||
|
||||
## Architecture
|
||||
|
||||
The prototype splits current M5 work into two CUDA stages:
|
||||
|
||||
1. `gdn_ai32_cuda`: one CTA per `(sequence, head, chunk)` computes the C32
|
||||
chunk-local triangular inverse `Ai = A^-1` and writes `[BT, BT]` f32 scratch.
|
||||
2. `gdn_chunked_ai32_cuda`: one CTA per `(sequence, head, value slab)` loads Ai
|
||||
for each chunk and performs the value-dependent work for its 64 output
|
||||
columns.
|
||||
|
||||
This mirrors the portable scheduling idea from vLLM/FLA without importing
|
||||
CuteDSL, TMA, or BF16 storage. It directly tests whether sharing A/Ai across
|
||||
slabs can beat the duplicated work that rejected Phase 10.
|
||||
|
||||
## Scratch
|
||||
|
||||
Ai scratch is sized:
|
||||
|
||||
```text
|
||||
n_seqs * H * ceil(n_tokens / 32) * 32 * 32 * sizeof(float)
|
||||
```
|
||||
|
||||
At `npp=2048,npl=32`, this is:
|
||||
|
||||
- MoE H=32: 256 MiB.
|
||||
- Dense H=48: 384 MiB.
|
||||
|
||||
Scratch allocation must use the existing ggml CUDA pool, be scoped to the op,
|
||||
and be default-off behind an explicit env selector.
|
||||
|
||||
## Selector
|
||||
|
||||
Use:
|
||||
|
||||
```text
|
||||
GDN_GLOBAL_AI32=1
|
||||
```
|
||||
|
||||
The default path remains current C16 M5. The candidate only engages when:
|
||||
|
||||
- `S_v == 128`
|
||||
- `n_tokens >= GDN_CHUNK_MIN`
|
||||
- `!KDA && !keep_rs_t`
|
||||
- `GDN_GLOBAL_AI32=1`
|
||||
|
||||
## Correctness
|
||||
|
||||
The first implementation uses f32 Ai to maximize chances of md5 stability. It
|
||||
must pass:
|
||||
|
||||
- `test-backend-ops -b CUDA0 -o GATED_DELTA_NET`
|
||||
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
|
||||
If md5 changes, the prototype must stop for KL before any performance claim.
|
||||
|
||||
## Performance
|
||||
|
||||
Compare same-session against current M5:
|
||||
|
||||
```text
|
||||
LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64
|
||||
```
|
||||
|
||||
versus:
|
||||
|
||||
```text
|
||||
LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1
|
||||
```
|
||||
|
||||
Run MoE and dense at `npp=512,2048`, `ntg=4`, `npl=32`.
|
||||
|
||||
## Decision Rule
|
||||
|
||||
Accept only if the prototype is correctness-safe and improves end-to-end S_PP.
|
||||
Reject if it is flat or slower. If rejected, save the diff under
|
||||
`/home/mudler/bench/phase13_gdn_global_ai32/rejected/` and do not add a LocalAI
|
||||
patch.
|
||||
@@ -0,0 +1,108 @@
|
||||
# GDN Shared-A/Ai Cost Model Design
|
||||
|
||||
## Context
|
||||
|
||||
The last two GDN experiments closed the low-conflict shortcut space:
|
||||
|
||||
- Phase 10 C32 slab M5 was md5-clean after tail-row zeroing but slower because
|
||||
each value slab recomputed the per-chunk triangular work.
|
||||
- Phase 11 QS-early M5 was md5-clean but still slower because moving `QS` did
|
||||
not remove a tensor-core pass.
|
||||
|
||||
The remaining algorithmic gap to vLLM/FLA is not another local reorder. vLLM
|
||||
builds the per-chunk triangular object once, solves/inverts it once, and reuses
|
||||
that result across the WY transform. llama.cpp's current C=16 M5 already
|
||||
computes A/T once for the full value width inside one CTA. A wider chunk only
|
||||
fits on GB10 if value columns are split into slabs, and slabs lose unless A/T
|
||||
is shared across them.
|
||||
|
||||
## Current Geometry
|
||||
|
||||
For `S_v = 128` and f32 state:
|
||||
|
||||
| Shape | Dynamic smem |
|
||||
|-------|--------------|
|
||||
| C16 full value width | 93,376 B / 91.19 KiB |
|
||||
| C32 full value width | 127,360 B / 124.38 KiB |
|
||||
| C32 with `dv_tile=64` plus U staging | 94,592 B / 92.38 KiB |
|
||||
|
||||
GB10's available dynamic smem leaves enough room for C16 full-width and C32
|
||||
half-width, but not for C32 full-width. That makes a shared-A/Ai design the only
|
||||
plausible C32 path.
|
||||
|
||||
## Candidate Approaches
|
||||
|
||||
### A. Global A/Ai Scratch Precompute
|
||||
|
||||
Add a first kernel that computes `A` and `Ai` once per `(sequence, head, chunk)`
|
||||
and materializes `Ai` in global scratch. A second kernel consumes `Ai` across
|
||||
value slabs.
|
||||
|
||||
Pros:
|
||||
|
||||
- Directly targets the Phase 10 failure mode.
|
||||
- Mirrors the portable part of vLLM/FLA's schedule.
|
||||
- Keeps each value-slab CTA within the GB10 smem limit.
|
||||
|
||||
Cons:
|
||||
|
||||
- Adds at least one extra kernel boundary.
|
||||
- Requires scratch allocation and lifetime management in ggml CUDA.
|
||||
- Scratch is large at real batch sizes. At `npl=32`, `BT=32`, f32 Ai costs:
|
||||
- H=40, T=2048: 320 MiB.
|
||||
- H=48, T=2048: 384 MiB.
|
||||
- H=64, T=2048: 512 MiB.
|
||||
- Needs careful profiling because global scratch traffic can erase the saved
|
||||
triangular recomputation.
|
||||
|
||||
### B. Shared A/Ai Inside One CTA With Reduced State Residency
|
||||
|
||||
Keep C32 in one CTA by moving some state or value scratch out of shared memory.
|
||||
|
||||
Pros:
|
||||
|
||||
- Avoids global Ai scratch and cross-kernel synchronization.
|
||||
- Could keep the current single-kernel structure.
|
||||
|
||||
Cons:
|
||||
|
||||
- The f32 state alone is 64 KiB. Removing enough shared memory for C32 full
|
||||
width likely means reading state from global during MMA tiles or reducing
|
||||
state residency, which attacks the current M5 strength.
|
||||
- Higher risk of lowering achieved bandwidth and breaking md5 via new ordering.
|
||||
|
||||
### C. Stay C16 and Stop GDN Kernel Work on GB10
|
||||
|
||||
Accept C16 M5 as the local GB10 ceiling and redirect parity work to another
|
||||
bucket or different hardware.
|
||||
|
||||
Pros:
|
||||
|
||||
- Avoids high-risk scratch and synchronization work.
|
||||
- Matches Phase 10/11 evidence that shortcuts are now exhausted.
|
||||
|
||||
Cons:
|
||||
|
||||
- Leaves the GDN prefill gap open.
|
||||
- Does not move toward vLLM prefill parity on GB10.
|
||||
|
||||
## Recommended Phase 12
|
||||
|
||||
Run a cost-model and dry-design phase before any source patch. The phase should
|
||||
produce a go/no-go decision for Approach A:
|
||||
|
||||
1. Extract actual GDN head counts and chunk counts for the MoE and dense GGUFs.
|
||||
2. Compute scratch sizes for `BT=32` and `BT=64` at the benchmark shapes.
|
||||
3. Estimate extra global traffic: Ai write + Ai read per value slab.
|
||||
4. Compare that traffic against the triangular recomputation saved by sharing
|
||||
A/Ai across slabs.
|
||||
5. Only if the model is plausible, write a Phase 13 implementation plan for a
|
||||
default-off global-scratch prototype.
|
||||
|
||||
## Decision Rule
|
||||
|
||||
Proceed to implementation only if the model shows a credible net win at
|
||||
`npp=2048, npl=32` without unreasonable memory growth. If the estimated scratch
|
||||
traffic or kernel-boundary overhead is close to the saved work, record a no-go
|
||||
and stop GDN kernel work on GB10 rather than adding a large patch that is likely
|
||||
to be rejected.
|
||||
Reference in New Issue
Block a user