From 5b8b33a302b1d1b9f20b8bcbb71c13e7c8bdf416 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 2 Jul 2026 21:04:37 +0000 Subject: [PATCH] docs(paged): record P5 FLA GDN NO-GO - GDN prefill bucket is a confirmed shared-hardware floor P5 ported the full six-kernel vLLM-FLA chunk_gated_delta_rule_fwd pipeline (cumsum, chunk_scaled_dot_kkt, blocked merge_16x16_to_64x64 solve_tril, recompute_w_u, register/smem-resident chunk_gated_delta_rule_fwd_h with the chunk loop in-kernel, chunk_fwd_o) to CUDA tf32 mma, per-kernel validated vs host fp64 (o NMSE 2.2e-7, final-state 1.2e-7), integrated behind LLAMA_GDN_FLA_CHUNK=1 (default-off), and A/B'd in-backend vs the shipped M5 chunked scan. P0 perf kill-gate FAILED decisively: nsys --cuda-graph-trace=node, MoE q36-35b-a3b, npp2048 M5 56.31 vs FLA 119.46 us/tok (FLA 2.12x SLOWER, gdn_delta_pct -112.1); npp512 2.29x slower; end-to-end S_PP -13.33%/-13.12%. GO required FLA >10% faster at npp2048. Novel decomposition: the blocked solve_tril is only ~2.8% of the FLA bucket (55.6 ms); fwd_h 46.2% (903 ms) + fwd_o 31.5% (617 ms) dominate. The cost is the state-recurrence GEMMs plus per-chunk h-state materialization to global LPDDR5x that FLA's split-kernel structure forces; the fused M5 single kernel keeps the 128x128 state resident in smem and never materializes per-chunk h, so it is 2.1x faster on GB10's low-bandwidth memory. So the GDN prefill bucket (+59.2, the single largest prefill lever) is a confirmed shared-hardware / memory-bandwidth floor, NOT recoverable by the blocked-solve algorithm. Extends Phase74 (standalone blocked-inverse 0.59x) and bf16-C64 (-18.75%). Gates: SMEM PASS (max 96KB < 99KB cap). KL band GREEN (FLA KLD 0.137028 vs control 0.136563, delta +0.000465 < 0.01; same-top-p 84.61%). DEFAULT path untouched (canonical md5 GREEN both models default-off AND FLA-on; MoE 8cb0ce23, dense 5951a5b4; GATED_DELTA_NET DEFAULT 46/46). Nothing landed; series stays at 46 patches. WIP on DGX fork branch p5-fla-gdn (2d64c37f0), NOT pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../docs/EXECUTION_REARCH_SCOPE.md | 106 +++++++++++++++++- .../docs/PARITY_HANDOFF.md | 84 ++++++++++++++ 2 files changed, 184 insertions(+), 6 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md index 993c68268..f6d64e0be 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md @@ -714,6 +714,96 @@ expected result and any throughput payoff lives on non-GB10 silicon (out of scop (6 patches) and upstream may add its own GDN paths; keep the new pipeline in a separate `.cu` and gate the dispatch narrowly. +#### P5 RESULT (NO-GO at the P0 perf kill-gate, recorded 2026-07-02, `LLAMA_GDN_FLA_CHUNK`, default-off) - the GDN prefill bucket is now a CONFIRMED SHARED-HARDWARE FLOOR + +The full six-kernel vLLM-FLA `chunk_gated_delta_rule_fwd` pipeline was **ported to +CUDA tf32 mma, per-kernel validated against a host fp64 reference, integrated behind +`LLAMA_GDN_FLA_CHUNK=1` (default-off), and A/B'd in-backend** against the shipped M5 +f32 chunked scan. It **lost decisively** and by the wrong sign, so `go=false` was the +kill-gate default, **nothing was built beyond P0, and nothing landed.** This is the +**scope-anticipated "expected null"** (the P5 section framed this as the program's +strictest kill-gate given Phase74's standalone blocked-inverse 0.59x), but the phase +delivered the one thing the prior evidence lacked: **the whole FLA pipeline run +in-backend with the register/smem-resident inter-chunk state and the chunk loop +in-kernel** - the exact form that "was never actually tested in-backend." It was tested +here, and the result **settles the GDN prefill bucket (bucket 1, +59.2, the single +largest prefill lever) as a shared-hardware / memory-bandwidth floor on GB10.** + +- **PERF GO GATE FAILED DECISIVELY (the decisive result).** GO required the in-pipeline + blocked `solve_tril` to beat the M5 f32 chunked scan by **> 10% at npp2048**. + Measured (nsys `--cuda-graph-trace=node`, MoE `q36-35b-a3b-nvfp4`, per distinct token + over the 30 GDN layers): **npp2048 M5 56.31 vs FLA 119.46 us/tok = FLA 2.12x SLOWER** + (`gdn_delta_pct_2048 = -112.1`); **npp512 M5 51.23 vs FLA 117.35 = 2.29x slower**. + End-to-end **S_PP regressed MoE -13.33% @npp2048 / -13.12% @npp512** (3-rep medians; + clears `max(2%, 3 sigma)` by a wide margin, and it is the wrong sign, so there is no + 3-sigma question). The shipped M5 remains `gdn_core` at **56.31 us/tok = 64.82% of + vLLM's FLA chunk-64 36.5 us/tok on this GB10**; the rejected FLA port was only **30.55% + of vLLM** (36.5/119.46) - a regression, not a recovery. This reproduces Phase74's + standalone blocked-inverse 0.59x and extends bf16-C64 (-18.75%), now **confirmed + in-backend** with the register-resident state + in-kernel chunk loop. +- **WHERE THE TIME WENT (the novel, valuable decomposition - the reason this NO-GO + matters beyond a rejection).** Per-kernel nsys share of the FLA bucket: the **blocked + `solve_tril` is only ~2.8% (55.6 ms)** - the algorithm the whole phase was about is + *cheap*. The bucket is dominated by **`chunk_gated_delta_rule_fwd_h` 46.2% (903 ms) + + `chunk_fwd_o` 31.5% (617 ms)**: the inter-chunk state-recurrence GEMMs plus the + **per-chunk h-state materialization to global LPDDR5x** that FLA's split-kernel + structure forces (`fwd_h` writes `h_pre` per chunk, `fwd_o` re-reads it). The fused M5 + single kernel keeps the 128x128 state **resident in smem and never materializes + per-chunk h**, so it is **2.1x faster on GB10's low-bandwidth memory.** So the novel + finding vs all prior evidence: **the blocked solve itself is not the floor - the floor + is the state-GEMM + h-materialization region, which the FLA structure makes WORSE than + M5, not better.** This is exactly the "materialize-everything tax" the scope warns of. + The binding silicon property is **memory bandwidth** (per-chunk h round-trips to + LPDDR5x), compounded by the **99 KB smem cap** that forces the FLA split (`fwd_h` and + `fwd_o` cannot co-reside), not the mma shapes or wave count. +- **SMEM GATE PASSES (all six kernels under the 99 KB opt-in cap at C=64; + `cudaOccupancyMaxActiveBlocksPerMultiprocessor`):** `k_kkt` 48 KB / 2 blk, `k_solve` + 38 KB / 2 blk, `k_wu` 48 KB / 2 blk, `k_fwdh` 80 KB / 1 blk, `k_fwdo` 96 KB / 1 blk - + **max 96 KB < 99 KB.** The kernels fit; they are simply bandwidth-floored above M5. +- **KL BAND GREEN / IN-BAND (model numerics sound):** FLA `KLD 0.137028` vs control + `0.136563` = **delta +0.000465 < 0.01**; same-top-p **84.61% vs 83.73%** control + (>= 84% baseline; FLA marginally better). Per-kernel bring-up validation vs host fp64 + on synthetic shapes: **o NMSE 2.2e-7, final-state 1.2e-7** (done BEFORE integration, + per the "do not debug six kernels blind" rule). +- **DEFAULT PATH UNTOUCHED (canonical md5 GREEN with the code present):** paged-MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, both + **default-off AND `LLAMA_GDN_FLA_CHUNK`-on** (the small-M greedy path bails to M5). + `test-backend-ops GATED_DELTA_NET` **DEFAULT 46/46 OK.** Decode untouched + (`GDN_CHUNK_MIN` untouched; decode stays on the sequential recurrence). +- **`test-backend-ops` env-on = 43-44/46 (`gdn_op_tests_env_on_green=false`; explicit + tolerance judgment).** The FLA-engaged `head_size=128, n_seq_tokens>=64` cases + marginally exceed the test's `1e-7` threshold (**ERR 1.03-1.06e-7**, fluctuating + across the boundary run-to-run) because this port uses **plain tf32** where the shipped + M5 uses **3xtf32 (CUTLASS fp32-emulation)** for the decay-coupled compounding state + products; M5-chunked (`LLAMA_KV_PAGED=1`, no FLA) passes the SAME cases at `< 1e-7`. + Judgment: a marginal tf32-vs-3xtf32 accuracy gap, **benign at the model level (KL + green)**; tightening the port to 3xtf32 would only add mma count and **deepen** the + perf NO-GO, so it was not pursued. +- **Engagement PROVEN:** `LLAMA_GDN_FLA_TRACE` fired `[gdn-fla] engage H=32 n_seqs=N + n_tokens=128 NT=2` in `batched-bench`; nsys shows all six `gdn_fla::` kernels executing + under `LLAMA_GDN_FLA_CHUNK=1` and none under default. Protocols honored: GPU lock held + throughout and released; `LLAMA_MAX_BATCH_TOKENS` unset; sm_121a; nsys + `--cuda-graph-trace=node`; 3+ iter S_PP medians; no external contention. +- **Provenance.** WIP on the DGX fork topic branch `p5-fla-gdn` at + `2d64c37f08ad323038a44a89ab32189527c6ba29` (base `localai-paged` `653bb2f3d`, **NOT + pushed, NOT landed**): new `ggml/src/ggml-cuda/gdn-blocked-solve.cu` + narrow dispatch + in `gated_delta_net.cu` / `gated_delta_net.cuh`. Fork `localai-paged` HEAD **untouched + at `653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; topic + branches `p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` left intact. Artifacts on the + DGX `~/bench/p5_fla_gdn/`: `killgate_20260702_204225/` (RESULTS.md, spp_control.txt, + spp_fla.txt, `nsys_{ctrl,fla}{2048,512}.{nsys-rep,kern.csv}`, GATES.txt, + `kl_moe_{ctrl,fla}.log`, occupancy.txt, gdn-blocked-solve.cu, p5_fla_test.cu) and + `standalone_20260702_203434/` (RESULTS.txt + p5_fla_test.cu, p5_m5_time.cu, + m5_kernel_body.cuh). +- **Honest delta vs the +59.2 expectation.** The scope's conservative expected recovery + was **~0-10 of the +59.2, "likely a shared-hardware floor."** Delivered: **0 recovery, + a -63 us/tok regression on the FLA arm**; the floor is **confirmed**. The shipped M5 + fused smem-resident chunked scan (56.31 us/tok) is the winner and is **at or near the + GB10 memory-bandwidth floor for this op.** This closes the last speculative prefill + lever in the program. What binds is silicon (LPDDR5x bandwidth on the per-chunk h + round-trip + the 99 KB smem cap forcing the split), not the algorithm; it lifts only on + datacenter Blackwell (HBM + larger smem + TMEM), consistent with section 4's framing. + ### P6: FP8 KV cache + smaller dtype/bandwidth items - **Goal:** halve decode-time KV cache traffic (K/V stored fp8-e4m3 with a scale) and @@ -749,7 +839,7 @@ Conservative, showing the math. Baselines from section 2. | 3 dtype boundary tax | +36.6 | P1 | ~30 | | 4 norms/glue (part) | +37.2 | P1 (norms) + P6 (FP8 proj) | ~18 | | 2 GEMM tiling | +56.5 | P2 + P3 | ~40 | -| 1 GDN scan | +59.2 | P5 (speculative) | ~0-10 | +| 1 GDN scan | +59.2 | P5 (NO-GO, CONFIRMED FLOOR) | 0 | | 5 dispatch | +5.9 | P2/P4 | ~3 | Recovered ~91-101 us/tok of 198.9. New paged wall ~295-305 us/tok. **Prefill S_PP goes @@ -819,11 +909,15 @@ honest ceiling on GB10 is roughly **prefill ~55-65%, serving-agg ~80%, decode-GP ## 6. Risks and open questions -- **P5 is likely a shared-hardware floor.** Phase74's standalone blocked-inverse ran at - 0.59x the direct solve, and the 99 KB smem cap forces C=16. Open question: does the - full FLA pipeline *in-backend* (register-resident inter-chunk state, chunk loop - in-kernel) behave differently from the standalone bench? If not, P5 recovers ~0 and is - recorded as a confirmed floor. Rank it last-but-one and gate it hardest. +- **P5 is a shared-hardware floor - RESOLVED / CONFIRMED (2026-07-02, see the P5 RESULT + above).** Phase74's standalone blocked-inverse ran at 0.59x the direct solve. The open + question was whether the full FLA pipeline *in-backend* (register-resident inter-chunk + state, chunk loop in-kernel) behaves differently from the standalone bench. **Answer: + no - it is 2.12x SLOWER than M5 at npp2048 (119.46 vs 56.31 us/tok), S_PP -13.3%.** The + per-kernel decomposition showed the blocked solve is only 2.8% of the bucket; the floor + is the state-GEMM + per-chunk h-materialization to LPDDR5x that FLA's split-kernel + structure forces (and the 99 KB smem cap that forces that split). P5 recovers 0 and is a + **confirmed shared-hardware / memory-bandwidth floor.** - **P1 segment-boundary converts.** Option A keeps f32 at segment edges; if the q36 residual stream has many short segments, the boundary converts could eat the win. Open: how many bf16 segments survive across a q36 layer, and does the shared-expert diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 4abaf8583..fd112ea95 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -2564,3 +2564,87 @@ determinism tsvs), `det2_20260702_193123/` + `det3_20260702_193649/` + `det4_20260702_194040/` (determinism diff-matrix), `perf_20260702_194359/` (raw_*.json + auto-written RESULTS.md). Environment: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, GPU lock held. + +## P5 FLA-faithful GDN prefill scan (blocked solve_tril port) - NO-GO at the perf kill-gate; the GDN prefill bucket is a CONFIRMED SHARED-HARDWARE FLOOR (recorded 2026-07-02) + +Fifth phase of the `EXECUTION_REARCH_SCOPE.md` additive program, and its **strictest +kill-gate**. The full six-kernel vLLM-FLA `chunk_gated_delta_rule_fwd` pipeline was +**ported to CUDA tf32 mma, per-kernel validated vs a host fp64 reference, integrated +behind `LLAMA_GDN_FLA_CHUNK=1` (default-off), and A/B'd in-backend** against the shipped +M5 f32 chunked scan. It lost decisively and by the wrong sign, so `go=false` was the +kill-gate default, nothing was built beyond P0, and nothing landed. See the "P5 RESULT" +subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; this closes the last +speculative prefill lever in the program. Summary and provenance: + +- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (the scope-anticipated "expected + null").** The P5 section framed this as the strictest kill-gate given Phase74's + standalone blocked-inverse 0.59x. P5 delivered the one thing prior evidence lacked: + **the whole FLA pipeline in-backend with the register/smem-resident inter-chunk state + and the chunk loop in-kernel** - the form that "was never actually tested in-backend." + It was tested here and **settles the GDN prefill bucket (bucket 1, +59.2, the single + largest prefill lever) as a shared-hardware / memory-bandwidth floor on GB10.** +- **PERF GO GATE FAILED DECISIVELY.** GO required the in-pipeline blocked `solve_tril` + to beat M5 by **> 10% at npp2048**. Measured (nsys `--cuda-graph-trace=node`, MoE + `q36-35b-a3b-nvfp4`, per distinct token over 30 GDN layers): **npp2048 M5 56.31 vs FLA + 119.46 us/tok = FLA 2.12x SLOWER** (`gdn_delta_pct_2048 = -112.1`); **npp512 M5 51.23 + vs FLA 117.35 = 2.29x slower.** End-to-end **S_PP regressed MoE -13.33% @npp2048 / + -13.12% @npp512** (3-rep medians; wrong sign, no 3-sigma question). The shipped M5 + stays `gdn_core` at **56.31 us/tok = 64.82% of vLLM's FLA chunk-64 36.5 us/tok**; the + rejected FLA port was only **30.55% of vLLM** (36.5/119.46). Reproduces Phase74's + standalone 0.59x and extends bf16-C64 (-18.75%), now **confirmed in-backend.** +- **WHERE THE TIME WENT (the novel, valuable decomposition - why this NO-GO matters).** + Per-kernel nsys share of the FLA bucket: **blocked `solve_tril` only ~2.8% (55.6 ms)** - + the algorithm the phase was about is *cheap*. Dominated by **`fwd_h` 46.2% (903 ms) + + `fwd_o` 31.5% (617 ms)**: the inter-chunk state-recurrence GEMMs + the **per-chunk + h-state materialization to global LPDDR5x** that FLA's split-kernel structure forces + (`fwd_h` writes `h_pre` per chunk, `fwd_o` re-reads it). The fused M5 single kernel + keeps the 128x128 state **resident in smem, never materializes per-chunk h**, so it is + **2.1x faster on GB10's low-bandwidth memory.** Novel finding vs all prior evidence: + **the blocked solve is not the floor - the floor is the state-GEMM + h-materialization + region, which the FLA structure makes WORSE than M5.** The binding silicon property is + **LPDDR5x memory bandwidth** (per-chunk h round-trip), compounded by the **99 KB smem + cap** that forces the `fwd_h`/`fwd_o` split - not mma shapes or wave count. +- **Correctness / cap gates (recorded, not decisive):** + - **SMEM GATE PASSES** (all six kernels under the 99 KB opt-in cap at C=64; + `cudaOccupancyMaxActiveBlocksPerMultiprocessor`): `k_kkt` 48 KB / 2 blk, `k_solve` + 38 KB / 2 blk, `k_wu` 48 KB / 2 blk, `k_fwdh` 80 KB / 1 blk, `k_fwdo` 96 KB / 1 blk - + **max 96 KB < 99 KB.** The kernels fit; they are bandwidth-floored above M5. + - **KL BAND GREEN / in-band:** FLA `KLD 0.137028` vs control `0.136563` = **delta + +0.000465 < 0.01**; same-top-p **84.61% vs 83.73%** control (>= 84% baseline). + Per-kernel bring-up vs host fp64: **o NMSE 2.2e-7, final-state 1.2e-7** (before + integration, per the "do not debug six kernels blind" rule). + - **DEFAULT PATH UNTOUCHED:** canonical md5 GREEN both models, **default-off AND + `LLAMA_GDN_FLA_CHUNK`-on** (paged-MoE `8cb0ce23`, dense `5951a5b4`; small-M greedy + bails to M5). `test-backend-ops GATED_DELTA_NET` **DEFAULT 46/46 OK.** Decode + untouched (`GDN_CHUNK_MIN` untouched; decode stays on the sequential recurrence). + - **`test-backend-ops` env-on = 43-44/46** (`gdn_op_tests_env_on_green=false`): the + FLA-engaged `head_size=128, n_seq_tokens>=64` cases marginally exceed `1e-7` + (**ERR 1.03-1.06e-7**, fluctuating across the boundary) because the port uses plain + **tf32** where M5 uses **3xtf32 (CUTLASS fp32-emulation)** for the decay-coupled + compounding state products; M5-chunked passes the SAME cases at `< 1e-7`. Judgment: + marginal tf32-vs-3xtf32 gap, benign at model level (KL green); 3xtf32 would add mma + count and deepen the perf NO-GO, so not pursued. + - **Engagement PROVEN:** `LLAMA_GDN_FLA_TRACE` fired `[gdn-fla] engage H=32 ...` in + `batched-bench`; nsys shows all six `gdn_fla::` kernels under + `LLAMA_GDN_FLA_CHUNK=1` and none under default. +- **Honest delta vs the +59.2 expectation.** Scope expected `~0-10 of the +59.2, "likely + a shared-hardware floor."` Delivered: **0 recovery, a -63 us/tok regression on the FLA + arm; the floor is confirmed.** M5's fused smem-resident chunked scan (56.31 us/tok) is + the winner and is at/near the GB10 memory-bandwidth floor for this op. What binds is + silicon (LPDDR5x bandwidth on the per-chunk h round-trip + the 99 KB smem cap forcing + the split), not the algorithm; it lifts only on datacenter Blackwell (HBM + larger + smem + TMEM), consistent with the scope's section-4 framing. + +Protocols honored: GPU lock held throughout and released; `LLAMA_MAX_BATCH_TOKENS` +unset; sm_121a; nsys `--cuda-graph-trace=node`; 3+ iter S_PP medians; no external +contention. WIP on the DGX fork topic branch `p5-fla-gdn` at +`2d64c37f08ad323038a44a89ab32189527c6ba29` (base `localai-paged` `653bb2f3d`, **NOT +pushed, NOT landed**): new `ggml/src/ggml-cuda/gdn-blocked-solve.cu` + narrow dispatch in +`gated_delta_net.cu` / `gated_delta_net.cuh`. Fork `localai-paged` HEAD **untouched at +`653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; topic branches +`p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` left intact. Artifacts on the DGX +`~/bench/p5_fla_gdn/`: `killgate_20260702_204225/` (RESULTS.md, spp_control.txt, +spp_fla.txt, `nsys_{ctrl,fla}{2048,512}.{nsys-rep,kern.csv}`, GATES.txt, +`kl_moe_{ctrl,fla}.log`, occupancy.txt, gdn-blocked-solve.cu, p5_fla_test.cu) and +`standalone_20260702_203434/` (RESULTS.txt + p5_fla_test.cu, p5_m5_time.cu, +m5_kernel_body.cuh).