mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record GDN tensor-core revalidation phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -10,10 +10,13 @@ with artifact path, gates, benchmark rows, and decision.
|
||||
- Current decision model: MoE `q36-35b-a3b-nvfp4`.
|
||||
- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase70.
|
||||
- Latest decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. It is
|
||||
correctness-clean but not serving-safe enough to default on.
|
||||
- Current tested source: DGX mirror
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase71.
|
||||
- Latest decision: keep shipped GDN M5 default as-is. It still beats
|
||||
sequential-disabled and serial-chunked GDN, and forced `GDN_TC=4` is within
|
||||
noise of the current default. Do not reopen smaller GDN kernel reorders on
|
||||
GB10.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
@@ -52,6 +55,51 @@ Decision:
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase71: GDN Tensor-Core Revalidation
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan: `docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`.
|
||||
- Artifact:
|
||||
`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`.
|
||||
- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Shape: MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`.
|
||||
|
||||
Canonical gates:
|
||||
|
||||
| gate | env | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|------|-----|---------|-----------|-------------------|-----------|--------------|
|
||||
| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | `1146/1146` | `806/806` |
|
||||
| sequential-disabled | `GDN_CHUNK_MIN=2147483647` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
|
||||
| serial-chunked | `GDN_TC=0 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
|
||||
| forced M5 | `GDN_TC=4 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run |
|
||||
|
||||
MoE prefill:
|
||||
|
||||
| arm | npp | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
|
||||
|-----|----:|---------:|-------:|---------:|------------:|
|
||||
| default | `512` | `2313.57` | `7.082` | `401.82` | `2231.28` |
|
||||
| sequential-disabled | `512` | `2198.28` | `7.453` | `392.50` | `2122.58` |
|
||||
| serial-chunked | `512` | `1787.49` | `9.166` | `396.23` | `1740.12` |
|
||||
| forced M5 | `512` | `2323.18` | `7.052` | `393.62` | `2238.13` |
|
||||
| default | `2048` | `2422.88` | `27.049` | `389.91` | `2398.50` |
|
||||
| sequential-disabled | `2048` | `2361.22` | `27.755` | `386.08` | `2337.91` |
|
||||
| serial-chunked | `2048` | `1699.77` | `38.556` | `389.48` | `1688.69` |
|
||||
| forced M5 | `2048` | `2420.52` | `27.075` | `388.72` | `2396.11` |
|
||||
|
||||
Ratios:
|
||||
|
||||
| npp | default/sequential S_PP | default/serial S_PP | forced/default S_PP |
|
||||
|-----|------------------------:|---------------------:|--------------------:|
|
||||
| `512` | `1.0524` | `1.2943` | `1.0042` |
|
||||
| `2048` | `1.0261` | `1.4254` | `0.9990` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep shipped GDN M5 default behavior.
|
||||
- Do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
|
||||
- The stale "two-Gram PoC before M5 exists" framing is superseded by the
|
||||
existing `0047` M5 implementation and this revalidation.
|
||||
|
||||
### Phase70: BF16 F32 Output Broader Serving
|
||||
|
||||
- Date: 2026-07-01.
|
||||
|
||||
@@ -1107,3 +1107,46 @@ Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is
|
||||
correctness-clean, but it materially regressed low-concurrency serving and
|
||||
slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it
|
||||
default-off only and move the next parity effort to a different lever.
|
||||
|
||||
## 16. PHASE71 RESULT: GDN TENSOR-CORE REVALIDATION
|
||||
|
||||
Phase71 challenged the stale GDN planning docs before starting more source work.
|
||||
Plan:
|
||||
`docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`.
|
||||
Benchmark ledger:
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
|
||||
DGX artifact:
|
||||
`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`.
|
||||
|
||||
Source under test stayed at DGX mirror commit
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was
|
||||
changed.
|
||||
|
||||
Canonical gates matched for all four GDN modes: MoE md5 `8cb0ce23`, dense md5
|
||||
`5951a5b4`, and `GATED_DELTA_NET 46/46`. Default also passed `MUL_MAT
|
||||
1146/1146` and `MUL_MAT_ID 806/806`.
|
||||
|
||||
MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`:
|
||||
|
||||
| arm | npp512 S_PP | npp2048 S_PP |
|
||||
|-----|------------:|-------------:|
|
||||
| default | `2313.57` | `2422.88` |
|
||||
| sequential-disabled (`GDN_CHUNK_MIN=2147483647`) | `2198.28` | `2361.22` |
|
||||
| serial-chunked (`GDN_TC=0 GDN_CHUNK_MIN=64`) | `1787.49` | `1699.77` |
|
||||
| forced M5 (`GDN_TC=4 GDN_CHUNK_MIN=64`) | `2323.18` | `2420.52` |
|
||||
|
||||
Decision: keep shipped GDN M5 default behavior. It still beats
|
||||
sequential-disabled by `+5.24%`/`+2.61%`, beats serial-chunked by
|
||||
`+29.43%`/`+42.54%`, and forced M5 is within noise of the current default. Do
|
||||
not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
|
||||
|
||||
Post-Phase71 do-not-reopen list for GB10:
|
||||
|
||||
- Smaller W4A16/MoE GEMM body, metadata, direct-activation, or quant/gather
|
||||
shortcuts.
|
||||
- GDN C32 slab, QS-early, Global-Ai32, or another low-conflict M5 reorder.
|
||||
- BF16 cuBLAS F32 output as a default-on policy.
|
||||
|
||||
The only GDN work that should be reconsidered is a larger FLA/CuteDSL-class
|
||||
blocked-solve implementation or a hardware pivot where the GB10 constraints no
|
||||
longer apply.
|
||||
|
||||
@@ -512,7 +512,20 @@ Files: design lives in `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_
|
||||
|
||||
## 5. Synthesized build plan + milestones + gate
|
||||
|
||||
All anchors verified. 0031's kernel body, the 7-step structure, the `GDN_CHUNK_MIN`/`GDN_CHUNK_OFF` gating at the `if constexpr (!KDA && !keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula, and the test-backend-ops shapes are all confirmed. The scope doc's KL gate, 3xtf32 ladder, risk register, and Phase 0-3 plan are confirmed. Here is the build-ready synthesis.
|
||||
Historical note: this plan predates the shipped f32-only M5 tensor-core GDN
|
||||
path in patch `0047`. Current code parses `GDN_CHUNK_MIN` and `GDN_TC`; the
|
||||
older `GDN_CHUNK_OFF` and `GDN_CHUNK_TC` names in this section are obsolete.
|
||||
Phase71 revalidated the current default against sequential-disabled and
|
||||
serial-chunked modes on DGX and kept M5 as shipped. Use this document as
|
||||
background for any larger FLA/CuteDSL-class redesign, not as the active next
|
||||
patch queue.
|
||||
|
||||
All anchors were verified at the time of writing. 0031's kernel body, the
|
||||
7-step structure, the `GDN_CHUNK_MIN` gating at the `if constexpr (!KDA &&
|
||||
!keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula,
|
||||
and the test-backend-ops shapes were confirmed. The scope doc's KL gate,
|
||||
3xtf32 ladder, risk register, and Phase 0-3 plan were confirmed. Here is the
|
||||
historical build-ready synthesis.
|
||||
|
||||
---
|
||||
|
||||
@@ -589,12 +602,14 @@ Each milestone is a **separate patch** stacked on 0031, **green on `test-backend
|
||||
|
||||
---
|
||||
|
||||
## (4) Slot into 0031's existing framework (opt-in, default-OFF)
|
||||
## (4) Slot into 0031's existing framework (historical, superseded by 0047)
|
||||
|
||||
Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launch_gated_delta_net` (0031 patch, after `init_fastdiv_values`). Extend, don't replace:
|
||||
|
||||
- Keep `GDN_CHUNK_MIN` (token threshold, default `INT_MAX` = off) and `GDN_CHUNK_OFF` (kill switch).
|
||||
- Add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1.
|
||||
- Current code keeps `GDN_CHUNK_MIN` as the token threshold and uses `GDN_TC`
|
||||
as the tensor-core level selector. It does not parse `GDN_CHUNK_OFF` or
|
||||
`GDN_CHUNK_TC`.
|
||||
- Historical plan: add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1.
|
||||
- New launcher `launch_gdn_chunked_tc<128, C, DV_TILE>` mirrors `launch_gdn_chunked`: `cudaFuncSetAttribute(...MaxDynamicSharedMemorySize...)` **return-checked** (0031 precedent), `grid = dim3(H, n_seqs, n_slabs)`, `block = dim3(256,1,1)`. Per-slab the kernel recomputes A/A⁻¹/gates (dv-independent), dv-slices S/Ud/O.
|
||||
- **Default OFF** (`gdn_chunk_min=INT_MAX`) exactly as 0031 ships. Flip the default to on **only when** the M8 A/B shows an S_PP win over the tuned sequential recurrence at the serving regime (n_seqs≥2) **and** the KL gate + adversarial op case hold - recorded in README s5 (dev notes / rejected-flat levers) and `PAGED_BITEXACT_NOTE.md`. Until then it ships like 0031: opt-in, regression-free default.
|
||||
- Extend the test-backend-ops block 0031 added (the `S_v==128` shapes at lines after :9398) so the tc path is exercised at C=64 and C=32 in CI.
|
||||
@@ -610,4 +625,4 @@ Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launc
|
||||
|
||||
**Risk 3 - Grid starvation at n_seqs=1 is structural (H=32 < the ~96 blocks needed for 2 blk/SM × 48 SM).** Only `DV_TILE=32` (4 slabs) fully saturates a single stream, and it pays ~1.5-2× redundant-A flops (A/A⁻¹/gates recomputed per slab) plus the per-chunk restage. **Kill criterion:** if the M8 bench shows single-stream (n_seqs=1) S_PP is slower than sequential even at full saturation (dv32×4) due to redundant-A + restage overhead, **and** the batched regime (n_seqs≥2) gain also fails to materialize → the lever only helps a regime the target workload doesn't hit → keep default-OFF, ship as opt-in experiment only, record. (If n_seqs≥2 *does* win, ship enabled for the serving regime and gate single-stream back to sequential via `GDN_CHUNK_MIN` + an n_seqs check - a partial, honest win.)
|
||||
|
||||
**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work.
|
||||
**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work.
|
||||
|
||||
@@ -140,6 +140,17 @@ At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap
|
||||
(`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep
|
||||
`LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever.
|
||||
|
||||
Phase71 revalidated the current shipped GDN tensor-core default before adding
|
||||
more GDN source work. Artifact:
|
||||
`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. Canonical
|
||||
MoE/dense md5 gates matched for default, sequential-disabled, serial-chunked,
|
||||
and forced M5 modes; `GATED_DELTA_NET` passed `46/46` for each mode, and
|
||||
default passed `MUL_MAT 1146/1146` plus `MUL_MAT_ID 806/806`. Current default
|
||||
beat sequential-disabled by `+5.24%`/`+2.61%` S_PP at `npp=512/2048`, beat
|
||||
serial-chunked by `+29.43%`/`+42.54%`, and forced `GDN_TC=4 GDN_CHUNK_MIN=64`
|
||||
was within noise of default (`+0.42%`/`-0.10%`). Decision: keep shipped M5 and
|
||||
do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
@@ -1836,6 +1847,16 @@ D2/D3/D4 for DECODE are all REJECTED by the methodology's "a faster kernel off t
|
||||
Honest scope on D1's payoff: at HIGH-concurrency serving the paged GPU is already 83.5% busy because overlapping request streams hide the host stalls, so D1's win concentrates at LOW-concurrency / latency / batch-1 (GPU 4-16% busy), where it is large. The complementary serving-throughput lever is FIXING PREFILL (GDN #101 + MoE GEMM D2/#105): paged's 2x-slower prefill steals serving cycles under continuous batching (~25-55% of the serving step is prefill work) - so the prefill levers ARE also serving-decode levers. GATE: separately-built in-backend A/B (compiled-in, so a runtime flag does NOT isolate it) showing higher static/low-concurrency decode t/s with no high-concurrency-serving regression; bit-exact greedy md5 (graph replay re-issues identical kernels).
|
||||
|
||||
### next_3_levers
|
||||
|
||||
Post-Phase71 supersession: this ranked list is historical. `0047` already
|
||||
ships the M5 tensor-core GDN path default-on under paged KV, Phase71
|
||||
revalidated it against sequential-disabled and serial-chunked baselines, and
|
||||
Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed
|
||||
D1 on the current GB10 path unless a fresh route trace proves a host-sync
|
||||
fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or
|
||||
quant/gather pass. Treat the list below as pre-Phase60 planning context, not an
|
||||
active queue.
|
||||
|
||||
Ranked, each with its pass-gate:
|
||||
|
||||
1) #101 TENSOR-CORE mma CHUNKED GDN PREFILL KERNEL (prefill, GO). #1 prefill-gap contributor (+59 us/tok, ~30%), ~3/4 math (tensor cores help) with 2.62x measured headroom on identical silicon, 1/4 layout folds in; also helps serving decode. GATE: Phase-0 regime already satisfied by this profile; Phase-1 two-Gram-product PoC must move S_PP in a SEPARATELY-BUILT in-backend A/B vs sequential (flat => NO-GO the multi-week build); then KL-gate (tf32/3xtf32) + greedy md5 + adversarial-decay op test; ship opt-in default-off until A/B beats sequential.
|
||||
|
||||
Reference in New Issue
Block a user