docs(paged): reject GDN C32 slab phase

Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 01:15:00 +00:00
parent ff3ad84191
commit 3da3b169fb
5 changed files with 170 additions and 15 deletions

View File

@@ -874,6 +874,57 @@ Source check:
Decision:
- Do not ship a Phase 10 source patch yet.
- Keep the baseline and source check as the entry gate for the next C32 slab
implementation task.
- A default-off C32 slab candidate was implemented and rejected by the
performance gate.
- The candidate was correctness-clean only after fixing a tail-chunk staging
bug: rows `t >= Cc` in the staged `U=T*RHS` copy-back must be zeroed before
state/output math. Before that fix, the dense gate produced a degenerate
transcript even though the focused op gate passed.
- After the tail fix, both default and forced-C32 modes matched the canonical
md5 gates exactly:
- MoE: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense: `5951a5b4d624ce891e22ab5fca9bc439`.
- KL was not needed because md5 stayed stable after the tail fix.
Correctness artifacts:
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_default_after_tailfix.md5`
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_default_after_tailfix.md5`
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_c32_after_tailfix.md5`
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_c32_after_tailfix.md5`
Performance A/B artifacts:
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt`
Performance A/B:
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|-------|------|----|----|---|----------|----------|-------|
| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 |
| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 |
| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 |
| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 |
| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 |
| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 |
| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 |
| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 |
Rejected diff:
- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`
Conclusion:
- Do not ship Phase 10 C32 slab as implemented.
- C32 slab is not a maintainable shortcut toward parity because duplicated
A/T recomputation per value slab outweighs the intended state-traffic
reduction.
- A future GDN prefill attempt should either share the `A/T` work across value
slabs or switch to a different FLA-style chunk design; it should not repeat
this env-gated two-slab M5 variant.

View File

@@ -173,6 +173,7 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
| bf16-C16 | bf16 Gram at C=16 | REJECTED | no win; bf16 mantissa unsafe on state-coupled products |
| BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) |
| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.

View File

@@ -172,6 +172,7 @@ products through tensor cores. The series chased that headroom.
| bf16-C16 | bf16 Gram at C=16 | rejected | no win over tf32-M5; bf16 mantissa unsafe on the state-coupled products | GDN build-plan s4 |
| BV block-occupancy A/B (tf32) | raise blocks/SM to test if occupancy is the bound | **REJECTED** (occupancy is NOT the bound; latency is wave-hidden) | two arms statistically equal: **1844 vs 1814 S_PP (-1.04%, within noise)** | GDNAB armA/armB |
| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not

View File

@@ -459,6 +459,40 @@ scope until a serving phase proves target-verification cost and rollback safety.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update
Phase 10 tested the tempting low-conflict shortcut for #101: keep the current
M5 tensor-core GDN form, raise the chunk to `C=32`, and split the value
dimension into two `dv_tile=64` slabs to stay within shared memory.
Result:
- The shortcut cannot be a launcher-only change. C32 requires staging
`U=T*RHS` because the existing M5 apply path relies on one 16-row tile being
held in registers before overwriting `Ud`.
- A default-off `GDN_C32_SLAB=1` candidate was built and md5-gated.
- The first candidate exposed a dense-only transcript failure on tail chunks;
root cause was copying uninitialized staged rows for `t >= Cc` back into
`Ud`. Zeroing those rows restored both canonical md5 gates:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`.
- Performance regressed after correctness was fixed:
MoE 2048 S_PP `2430.32 -> 2054.86`; dense 2048 S_PP `1019.25 -> 903.73`.
Decision:
- **REJECT** the two-slab C32 M5 variant.
- Do not add it to the LocalAI patch stack.
- The likely blocker is duplicated A/T recomputation per value slab; future GDN
work must share that work across slabs or move to a different FLA-style
chunked design rather than repeating this env-gated shortcut.
Artifacts:
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/`
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/`
- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`
---
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)

View File

@@ -121,7 +121,7 @@ Implication:
- Because the candidate changes the solve/apply mechanics, it requires a
focused `GATED_DELTA_NET` op gate before any prefill A/B.
- [ ] **Step 1: Add an explicit env selector**
- [x] **Step 1: Add an explicit env selector**
Use an env var such as:
@@ -131,7 +131,7 @@ GDN_C32_SLAB=1
The default path must stay current M5.
- [ ] **Step 2: Introduce a C=32, dv_tile=64 launch**
- [x] **Step 2: Introduce a C=32, dv_tile=64 launch**
Target shape:
@@ -147,7 +147,7 @@ Initial prototype rules:
- no decode routing,
- no D2H synchronization.
- [ ] **Step 3: Build on DGX**
- [x] **Step 3: Build on DGX**
Run:
@@ -157,12 +157,25 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build .
Expected: build succeeds.
Result:
- Candidate implemented in the llama.cpp fork as a default-off
`GDN_C32_SLAB=1` path.
- Kernel generalized to `DV_TILE=64`, with two value slabs for `S_v=128`.
- C32 `U=T*RHS` writes were staged through a slab-local `Utmp` buffer to avoid
read/write aliasing against the RHS in `Ud`.
- Initial md5 failed on dense because tail chunks copied uninitialized staged
rows back into `Ud`; the root-cause fix zeroed `t >= Cc` rows during the
staged copy-back.
- DGX build succeeded after the tail fix:
`cmake --build . --target test-backend-ops llama-completion -j 8`.
## Task 3: Correctness Gates
**Files:**
- Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/gates/`
- [ ] **Step 1: Run `GATED_DELTA_NET` op gate**
- [x] **Step 1: Run `GATED_DELTA_NET` op gate**
Run default and forced C32 slab modes:
@@ -180,7 +193,7 @@ Required coverage to inspect in logs:
- permuted layout,
- adversarial decay.
- [ ] **Step 2: Run canonical md5 gates**
- [x] **Step 2: Run canonical md5 gates**
Run MoE and dense greedy gates with and without `GDN_C32_SLAB=1`.
@@ -191,18 +204,34 @@ MoE 8cb0ce23777bf55f92f63d0292c756b0
Dense 5951a5b4d624ce891e22ab5fca9bc439
```
- [ ] **Step 3: Run KL gate if md5 changes**
- [x] **Step 3: Run KL gate if md5 changes**
If the C32 slab path changes reduction order and therefore md5, stop and run the
existing KL procedure from `PAGED_BITEXACT_NOTE.md`. Keep the patch only if the
new path is KL-benign and no worse than current M5.
Result:
- Default op gate:
`/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt`
- Forced C32 op gate:
`/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt`
- Both `GATED_DELTA_NET` CUDA0 gates passed.
- Canonical default md5 after tail fix:
- MoE: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense: `5951a5b4d624ce891e22ab5fca9bc439`
- Forced C32 md5 after tail fix:
- MoE: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense: `5951a5b4d624ce891e22ab5fca9bc439`
- KL gate was not needed because the md5 gates matched the canonical outputs
exactly after the tail-row fix.
## Task 4: Performance A/B
**Files:**
- Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/ab/`
- [ ] **Step 1: Run C32 slab prefill at `npp=512`**
- [x] **Step 1: Run C32 slab prefill at `npp=512`**
Compare:
@@ -213,30 +242,69 @@ candidate: GDN_TC=5 GDN_CHUNK_MIN=64 GDN_C32_SLAB=1
Pass: candidate beats current M5 S_PP outside noise.
- [ ] **Step 2: Run C32 slab prefill at `npp=2048`**
- [x] **Step 2: Run C32 slab prefill at `npp=2048`**
Use the same A/B. Pass requires a net S_PP improvement or a clear GDN bucket
reduction without a larger regression elsewhere.
- [ ] **Step 3: Reject if duplicated A/T work cancels the state-traffic win**
- [x] **Step 3: Reject if duplicated A/T work cancels the state-traffic win**
If the candidate only shifts time between A/T recomputation and state traffic
without a net win, save the diff as a rejected artifact and update this plan.
Result:
Artifacts:
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt`
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt`
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|-------|------|----|----|---|----------|----------|-------|
| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 |
| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 |
| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 |
| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 |
| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 |
| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 |
| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 |
| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 |
Decision:
- Reject the C32 slab source patch.
- The candidate is correctness-clean after tail-row zeroing, but it regresses
S_PP in both model families.
- The likely mechanism is that recomputing `A/T` once per value slab cancels
the intended state-traffic win; optimizing this would require a broader
shared-work design rather than a small, low-conflict shortcut patch.
- Rejected diff saved at:
`/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`.
## Task 5: Mirror or Reject
**Files:**
- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- [ ] **Step 1: Commit accepted fork patch**
- [x] **Step 1: Commit accepted fork patch**
Commit only after correctness and performance gates pass.
- [ ] **Step 2: Generate LocalAI patch**
- [x] **Step 2: Generate LocalAI patch**
Use `git format-patch`; do not hand-edit the generated patch.
- [ ] **Step 3: Update docs**
- [x] **Step 3: Update docs**
Record exact artifacts, md5/KL results, and performance decision.
Result:
- No fork commit and no LocalAI patch were generated because Phase 10 was
rejected by the performance gate.
- The llama.cpp fork and DGX mirror were restored to the prior accepted state.
- This plan and the parity docs record the rejected source candidate so it is
not repeated as an accidental "obvious" follow-up.