From e5c5746c0a4bd1d8be2a63f3d7c046d2b3c2afb4 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 14:05:20 +0000 Subject: [PATCH] docs(paged): record GDN tensor-core revalidation phase Assisted-by: Codex:gpt-5 --- .../llama-cpp-localai-paged/docs/BENCHMARK.md | 56 +++++++++++++++++-- .../docs/PARITY_HANDOFF.md | 43 ++++++++++++++ .../docs/TENSORCORE_GDN_BUILD_PLAN.md | 25 +++++++-- .../docs/VLLM_PARITY_LEVER_MAP.md | 21 +++++++ 4 files changed, 136 insertions(+), 9 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md index 8a8baf3cd..66baaf6d7 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md @@ -10,10 +10,13 @@ with artifact path, gates, benchmark rows, and decision. - Current decision model: MoE `q36-35b-a3b-nvfp4`. - Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. -- Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. -- Latest attempt: Phase70. -- Latest decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. It is - correctness-clean but not serving-safe enough to default on. +- Current tested source: DGX mirror + `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Latest attempt: Phase71. +- Latest decision: keep shipped GDN M5 default as-is. It still beats + sequential-disabled and serial-chunked GDN, and forced `GDN_TC=4` is within + noise of the current default. Do not reopen smaller GDN kernel reorders on + GB10. ## Current Serving Record @@ -52,6 +55,51 @@ Decision: ## Attempt Log +### Phase71: GDN Tensor-Core Revalidation + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`. +- Artifact: + `/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. +- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Shape: MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`. + +Canonical gates: + +| gate | env | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|-------------------|-----------|--------------| +| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | `1146/1146` | `806/806` | +| sequential-disabled | `GDN_CHUNK_MIN=2147483647` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run | +| serial-chunked | `GDN_TC=0 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run | +| forced M5 | `GDN_TC=4 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run | + +MoE prefill: + +| arm | npp | S_PP t/s | T_PP s | S_TG t/s | total S t/s | +|-----|----:|---------:|-------:|---------:|------------:| +| default | `512` | `2313.57` | `7.082` | `401.82` | `2231.28` | +| sequential-disabled | `512` | `2198.28` | `7.453` | `392.50` | `2122.58` | +| serial-chunked | `512` | `1787.49` | `9.166` | `396.23` | `1740.12` | +| forced M5 | `512` | `2323.18` | `7.052` | `393.62` | `2238.13` | +| default | `2048` | `2422.88` | `27.049` | `389.91` | `2398.50` | +| sequential-disabled | `2048` | `2361.22` | `27.755` | `386.08` | `2337.91` | +| serial-chunked | `2048` | `1699.77` | `38.556` | `389.48` | `1688.69` | +| forced M5 | `2048` | `2420.52` | `27.075` | `388.72` | `2396.11` | + +Ratios: + +| npp | default/sequential S_PP | default/serial S_PP | forced/default S_PP | +|-----|------------------------:|---------------------:|--------------------:| +| `512` | `1.0524` | `1.2943` | `1.0042` | +| `2048` | `1.0261` | `1.4254` | `0.9990` | + +Decision: + +- Keep shipped GDN M5 default behavior. +- Do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10. +- The stale "two-Gram PoC before M5 exists" framing is superseded by the + existing `0047` M5 implementation and this revalidation. + ### Phase70: BF16 F32 Output Broader Serving - Date: 2026-07-01. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 9854db763..25cce03d6 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -1107,3 +1107,46 @@ Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is correctness-clean, but it materially regressed low-concurrency serving and slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it default-off only and move the next parity effort to a different lever. + +## 16. PHASE71 RESULT: GDN TENSOR-CORE REVALIDATION + +Phase71 challenged the stale GDN planning docs before starting more source work. +Plan: +`docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`. +Benchmark ledger: +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. +DGX artifact: +`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. + +Source under test stayed at DGX mirror commit +`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was +changed. + +Canonical gates matched for all four GDN modes: MoE md5 `8cb0ce23`, dense md5 +`5951a5b4`, and `GATED_DELTA_NET 46/46`. Default also passed `MUL_MAT +1146/1146` and `MUL_MAT_ID 806/806`. + +MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`: + +| arm | npp512 S_PP | npp2048 S_PP | +|-----|------------:|-------------:| +| default | `2313.57` | `2422.88` | +| sequential-disabled (`GDN_CHUNK_MIN=2147483647`) | `2198.28` | `2361.22` | +| serial-chunked (`GDN_TC=0 GDN_CHUNK_MIN=64`) | `1787.49` | `1699.77` | +| forced M5 (`GDN_TC=4 GDN_CHUNK_MIN=64`) | `2323.18` | `2420.52` | + +Decision: keep shipped GDN M5 default behavior. It still beats +sequential-disabled by `+5.24%`/`+2.61%`, beats serial-chunked by +`+29.43%`/`+42.54%`, and forced M5 is within noise of the current default. Do +not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10. + +Post-Phase71 do-not-reopen list for GB10: + +- Smaller W4A16/MoE GEMM body, metadata, direct-activation, or quant/gather + shortcuts. +- GDN C32 slab, QS-early, Global-Ai32, or another low-conflict M5 reorder. +- BF16 cuBLAS F32 output as a default-on policy. + +The only GDN work that should be reconsidered is a larger FLA/CuteDSL-class +blocked-solve implementation or a hardware pivot where the GB10 constraints no +longer apply. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md b/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md index dc0d24af7..2331691ef 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md @@ -512,7 +512,20 @@ Files: design lives in `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_ ## 5. Synthesized build plan + milestones + gate -All anchors verified. 0031's kernel body, the 7-step structure, the `GDN_CHUNK_MIN`/`GDN_CHUNK_OFF` gating at the `if constexpr (!KDA && !keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula, and the test-backend-ops shapes are all confirmed. The scope doc's KL gate, 3xtf32 ladder, risk register, and Phase 0-3 plan are confirmed. Here is the build-ready synthesis. +Historical note: this plan predates the shipped f32-only M5 tensor-core GDN +path in patch `0047`. Current code parses `GDN_CHUNK_MIN` and `GDN_TC`; the +older `GDN_CHUNK_OFF` and `GDN_CHUNK_TC` names in this section are obsolete. +Phase71 revalidated the current default against sequential-disabled and +serial-chunked modes on DGX and kept M5 as shipped. Use this document as +background for any larger FLA/CuteDSL-class redesign, not as the active next +patch queue. + +All anchors were verified at the time of writing. 0031's kernel body, the +7-step structure, the `GDN_CHUNK_MIN` gating at the `if constexpr (!KDA && +!keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula, +and the test-backend-ops shapes were confirmed. The scope doc's KL gate, +3xtf32 ladder, risk register, and Phase 0-3 plan were confirmed. Here is the +historical build-ready synthesis. --- @@ -589,12 +602,14 @@ Each milestone is a **separate patch** stacked on 0031, **green on `test-backend --- -## (4) Slot into 0031's existing framework (opt-in, default-OFF) +## (4) Slot into 0031's existing framework (historical, superseded by 0047) Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launch_gated_delta_net` (0031 patch, after `init_fastdiv_values`). Extend, don't replace: -- Keep `GDN_CHUNK_MIN` (token threshold, default `INT_MAX` = off) and `GDN_CHUNK_OFF` (kill switch). -- Add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1. +- Current code keeps `GDN_CHUNK_MIN` as the token threshold and uses `GDN_TC` + as the tensor-core level selector. It does not parse `GDN_CHUNK_OFF` or + `GDN_CHUNK_TC`. +- Historical plan: add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1. - New launcher `launch_gdn_chunked_tc<128, C, DV_TILE>` mirrors `launch_gdn_chunked`: `cudaFuncSetAttribute(...MaxDynamicSharedMemorySize...)` **return-checked** (0031 precedent), `grid = dim3(H, n_seqs, n_slabs)`, `block = dim3(256,1,1)`. Per-slab the kernel recomputes A/A⁻¹/gates (dv-independent), dv-slices S/Ud/O. - **Default OFF** (`gdn_chunk_min=INT_MAX`) exactly as 0031 ships. Flip the default to on **only when** the M8 A/B shows an S_PP win over the tuned sequential recurrence at the serving regime (n_seqs≥2) **and** the KL gate + adversarial op case hold - recorded in README s5 (dev notes / rejected-flat levers) and `PAGED_BITEXACT_NOTE.md`. Until then it ships like 0031: opt-in, regression-free default. - Extend the test-backend-ops block 0031 added (the `S_v==128` shapes at lines after :9398) so the tc path is exercised at C=64 and C=32 in CI. @@ -610,4 +625,4 @@ Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launc **Risk 3 - Grid starvation at n_seqs=1 is structural (H=32 < the ~96 blocks needed for 2 blk/SM × 48 SM).** Only `DV_TILE=32` (4 slabs) fully saturates a single stream, and it pays ~1.5-2× redundant-A flops (A/A⁻¹/gates recomputed per slab) plus the per-chunk restage. **Kill criterion:** if the M8 bench shows single-stream (n_seqs=1) S_PP is slower than sequential even at full saturation (dv32×4) due to redundant-A + restage overhead, **and** the batched regime (n_seqs≥2) gain also fails to materialize → the lever only helps a regime the target workload doesn't hit → keep default-OFF, ship as opt-in experiment only, record. (If n_seqs≥2 *does* win, ship enabled for the serving regime and gate single-stream back to sequential via `GDN_CHUNK_MIN` + an n_seqs check - a partial, honest win.) -**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work. \ No newline at end of file +**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index aeff3ab02..860919cf4 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -140,6 +140,17 @@ At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap (`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever. +Phase71 revalidated the current shipped GDN tensor-core default before adding +more GDN source work. Artifact: +`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. Canonical +MoE/dense md5 gates matched for default, sequential-disabled, serial-chunked, +and forced M5 modes; `GATED_DELTA_NET` passed `46/46` for each mode, and +default passed `MUL_MAT 1146/1146` plus `MUL_MAT_ID 806/806`. Current default +beat sequential-disabled by `+5.24%`/`+2.61%` S_PP at `npp=512/2048`, beat +serial-chunked by `+29.43%`/`+42.54%`, and forced `GDN_TC=4 GDN_CHUNK_MIN=64` +was within noise of default (`+0.42%`/`-0.10%`). Decision: keep shipped M5 and +do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10. + Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189). ## 2. Decode-serving compute hypotheses (ranked) @@ -1836,6 +1847,16 @@ D2/D3/D4 for DECODE are all REJECTED by the methodology's "a faster kernel off t Honest scope on D1's payoff: at HIGH-concurrency serving the paged GPU is already 83.5% busy because overlapping request streams hide the host stalls, so D1's win concentrates at LOW-concurrency / latency / batch-1 (GPU 4-16% busy), where it is large. The complementary serving-throughput lever is FIXING PREFILL (GDN #101 + MoE GEMM D2/#105): paged's 2x-slower prefill steals serving cycles under continuous batching (~25-55% of the serving step is prefill work) - so the prefill levers ARE also serving-decode levers. GATE: separately-built in-backend A/B (compiled-in, so a runtime flag does NOT isolate it) showing higher static/low-concurrency decode t/s with no high-concurrency-serving regression; bit-exact greedy md5 (graph replay re-issues identical kernels). ### next_3_levers + +Post-Phase71 supersession: this ranked list is historical. `0047` already +ships the M5 tensor-core GDN path default-on under paged KV, Phase71 +revalidated it against sequential-disabled and serial-chunked baselines, and +Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed +D1 on the current GB10 path unless a fresh route trace proves a host-sync +fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or +quant/gather pass. Treat the list below as pre-Phase60 planning context, not an +active queue. + Ranked, each with its pass-gate: 1) #101 TENSOR-CORE mma CHUNKED GDN PREFILL KERNEL (prefill, GO). #1 prefill-gap contributor (+59 us/tok, ~30%), ~3/4 math (tensor cores help) with 2.62x measured headroom on identical silicon, 1/4 layout folds in; also helps serving decode. GATE: Phase-0 regime already satisfied by this profile; Phase-1 two-Gram-product PoC must move S_PP in a SEPARATELY-BUILT in-backend A/B vs sequential (flat => NO-GO the multi-week build); then KL-gate (tf32/3xtf32) + greedy md5 + adversarial-decay op test; ship opt-in default-off until A/B beats sequential.