From 3da3b169fb8eb88d3848c29c09609380ba153989 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 01:15:00 +0000 Subject: [PATCH] docs(paged): reject GDN C32 slab phase Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 57 +++++++++++- .../docs/PARITY_HANDOFF.md | 1 + .../docs/VLLM_PARITY_FINAL.md | 1 + .../docs/VLLM_PARITY_LEVER_MAP.md | 34 +++++++ .../plans/2026-07-01-gdn-c32-slab-phase10.md | 92 ++++++++++++++++--- 5 files changed, 170 insertions(+), 15 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 83e6c7115..1d978c691 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -874,6 +874,57 @@ Source check: Decision: -- Do not ship a Phase 10 source patch yet. -- Keep the baseline and source check as the entry gate for the next C32 slab - implementation task. +- A default-off C32 slab candidate was implemented and rejected by the + performance gate. +- The candidate was correctness-clean only after fixing a tail-chunk staging + bug: rows `t >= Cc` in the staged `U=T*RHS` copy-back must be zeroed before + state/output math. Before that fix, the dense gate produced a degenerate + transcript even though the focused op gate passed. +- After the tail fix, both default and forced-C32 modes matched the canonical + md5 gates exactly: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed because md5 stayed stable after the tail fix. + +Correctness artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_default_after_tailfix.md5` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_default_after_tailfix.md5` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_c32_after_tailfix.md5` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_c32_after_tailfix.md5` + +Performance A/B artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt` + +Performance A/B: + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 | +| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 | +| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 | +| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 | +| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 | +| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 | +| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 | +| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 | + +Rejected diff: + +- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff` + +Conclusion: + +- Do not ship Phase 10 C32 slab as implemented. +- C32 slab is not a maintainable shortcut toward parity because duplicated + A/T recomputation per value slab outweighs the intended state-traffic + reduction. +- A future GDN prefill attempt should either share the `A/T` work across value + slabs or switch to a different FLA-style chunk design; it should not repeat + this env-gated two-slab M5 variant. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 92c65028b..897f3bc46 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -173,6 +173,7 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga | bf16-C16 | bf16 Gram at C=16 | REJECTED | no win; bf16 mantissa unsafe on state-coupled products | | BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) | | bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates | +| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 | Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index b299e8183..438855fd8 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -172,6 +172,7 @@ products through tensor cores. The series chased that headroom. | bf16-C16 | bf16 Gram at C=16 | rejected | no win over tf32-M5; bf16 mantissa unsafe on the state-coupled products | GDN build-plan s4 | | BV block-occupancy A/B (tf32) | raise blocks/SM to test if occupancy is the bound | **REJECTED** (occupancy is NOT the bound; latency is wave-hidden) | two arms statistically equal: **1844 vs 1814 S_PP (-1.04%, within noise)** | GDNAB armA/armB | | bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan | +| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab | **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2) intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 9c882ae03..fe6e96d2c 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -459,6 +459,40 @@ scope until a serving phase proves target-verification cost and rollback safety. Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. +### Phase 10 GDN C32 slab update + +Phase 10 tested the tempting low-conflict shortcut for #101: keep the current +M5 tensor-core GDN form, raise the chunk to `C=32`, and split the value +dimension into two `dv_tile=64` slabs to stay within shared memory. + +Result: + +- The shortcut cannot be a launcher-only change. C32 requires staging + `U=T*RHS` because the existing M5 apply path relies on one 16-row tile being + held in registers before overwriting `Ud`. +- A default-off `GDN_C32_SLAB=1` candidate was built and md5-gated. +- The first candidate exposed a dense-only transcript failure on tail chunks; + root cause was copying uninitialized staged rows for `t >= Cc` back into + `Ud`. Zeroing those rows restored both canonical md5 gates: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- Performance regressed after correctness was fixed: + MoE 2048 S_PP `2430.32 -> 2054.86`; dense 2048 S_PP `1019.25 -> 903.73`. + +Decision: + +- **REJECT** the two-slab C32 M5 variant. +- Do not add it to the LocalAI patch stack. +- The likely blocker is duplicated A/T recomputation per value slab; future GDN + work must share that work across slabs or move to a different FLA-style + chunked design rather than repeating this env-gated shortcut. + +Artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/` +- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff` + --- # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026) diff --git a/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md b/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md index 3c53d1202..7adf3c85e 100644 --- a/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md +++ b/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md @@ -121,7 +121,7 @@ Implication: - Because the candidate changes the solve/apply mechanics, it requires a focused `GATED_DELTA_NET` op gate before any prefill A/B. -- [ ] **Step 1: Add an explicit env selector** +- [x] **Step 1: Add an explicit env selector** Use an env var such as: @@ -131,7 +131,7 @@ GDN_C32_SLAB=1 The default path must stay current M5. -- [ ] **Step 2: Introduce a C=32, dv_tile=64 launch** +- [x] **Step 2: Introduce a C=32, dv_tile=64 launch** Target shape: @@ -147,7 +147,7 @@ Initial prototype rules: - no decode routing, - no D2H synchronization. -- [ ] **Step 3: Build on DGX** +- [x] **Step 3: Build on DGX** Run: @@ -157,12 +157,25 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . Expected: build succeeds. +Result: + +- Candidate implemented in the llama.cpp fork as a default-off + `GDN_C32_SLAB=1` path. +- Kernel generalized to `DV_TILE=64`, with two value slabs for `S_v=128`. +- C32 `U=T*RHS` writes were staged through a slab-local `Utmp` buffer to avoid + read/write aliasing against the RHS in `Ud`. +- Initial md5 failed on dense because tail chunks copied uninitialized staged + rows back into `Ud`; the root-cause fix zeroed `t >= Cc` rows during the + staged copy-back. +- DGX build succeeded after the tail fix: + `cmake --build . --target test-backend-ops llama-completion -j 8`. + ## Task 3: Correctness Gates **Files:** - Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/gates/` -- [ ] **Step 1: Run `GATED_DELTA_NET` op gate** +- [x] **Step 1: Run `GATED_DELTA_NET` op gate** Run default and forced C32 slab modes: @@ -180,7 +193,7 @@ Required coverage to inspect in logs: - permuted layout, - adversarial decay. -- [ ] **Step 2: Run canonical md5 gates** +- [x] **Step 2: Run canonical md5 gates** Run MoE and dense greedy gates with and without `GDN_C32_SLAB=1`. @@ -191,18 +204,34 @@ MoE 8cb0ce23777bf55f92f63d0292c756b0 Dense 5951a5b4d624ce891e22ab5fca9bc439 ``` -- [ ] **Step 3: Run KL gate if md5 changes** +- [x] **Step 3: Run KL gate if md5 changes** If the C32 slab path changes reduction order and therefore md5, stop and run the existing KL procedure from `PAGED_BITEXACT_NOTE.md`. Keep the patch only if the new path is KL-benign and no worse than current M5. +Result: + +- Default op gate: + `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt` +- Forced C32 op gate: + `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt` +- Both `GATED_DELTA_NET` CUDA0 gates passed. +- Canonical default md5 after tail fix: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense: `5951a5b4d624ce891e22ab5fca9bc439` +- Forced C32 md5 after tail fix: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense: `5951a5b4d624ce891e22ab5fca9bc439` +- KL gate was not needed because the md5 gates matched the canonical outputs + exactly after the tail-row fix. + ## Task 4: Performance A/B **Files:** - Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/ab/` -- [ ] **Step 1: Run C32 slab prefill at `npp=512`** +- [x] **Step 1: Run C32 slab prefill at `npp=512`** Compare: @@ -213,30 +242,69 @@ candidate: GDN_TC=5 GDN_CHUNK_MIN=64 GDN_C32_SLAB=1 Pass: candidate beats current M5 S_PP outside noise. -- [ ] **Step 2: Run C32 slab prefill at `npp=2048`** +- [x] **Step 2: Run C32 slab prefill at `npp=2048`** Use the same A/B. Pass requires a net S_PP improvement or a clear GDN bucket reduction without a larger regression elsewhere. -- [ ] **Step 3: Reject if duplicated A/T work cancels the state-traffic win** +- [x] **Step 3: Reject if duplicated A/T work cancels the state-traffic win** If the candidate only shifts time between A/T recomputation and state traffic without a net win, save the diff as a rejected artifact and update this plan. +Result: + +Artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt` + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 | +| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 | +| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 | +| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 | +| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 | +| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 | +| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 | +| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 | + +Decision: + +- Reject the C32 slab source patch. +- The candidate is correctness-clean after tail-row zeroing, but it regresses + S_PP in both model families. +- The likely mechanism is that recomputing `A/T` once per value slab cancels + the intended state-traffic win; optimizing this would require a broader + shared-work design rather than a small, low-conflict shortcut patch. +- Rejected diff saved at: + `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`. + ## Task 5: Mirror or Reject **Files:** - Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch` - Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` -- [ ] **Step 1: Commit accepted fork patch** +- [x] **Step 1: Commit accepted fork patch** Commit only after correctness and performance gates pass. -- [ ] **Step 2: Generate LocalAI patch** +- [x] **Step 2: Generate LocalAI patch** Use `git format-patch`; do not hand-edit the generated patch. -- [ ] **Step 3: Update docs** +- [x] **Step 3: Update docs** Record exact artifacts, md5/KL results, and performance decision. + +Result: + +- No fork commit and no LocalAI patch were generated because Phase 10 was + rejected by the performance gate. +- The llama.cpp fork and DGX mirror were restored to the prior accepted state. +- This plan and the parity docs record the rejected source candidate so it is + not repeated as an accidental "obvious" follow-up.