diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index e65686307..83e6c7115 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -839,3 +839,41 @@ Decision: - Do not enable MTP by default in LocalAI or llama-server. - Do not benchmark MTP as a parity win until a serving/API phase adds rollback gates for hybrid SSM/KV state and measures target verification throughput. + +## Phase 10 GDN C32 Slab Baseline and Source Check + +Phase 10 starts a separate GDN prefill path; it does not reopen the rejected +decode `GDN_NW/GDN_CPW` grid. + +Current M5 baseline artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_moe_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_dense_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/summary_rows.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/provenance.txt` + +Current M5 baseline: + +| Model | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|----|----|---|----------|----------|-------| +| MoE | 512 | 4 | 32 | 2314.18 | 359.16 | 2220.48 | +| MoE | 2048 | 4 | 32 | 2439.95 | 389.43 | 2415.16 | +| Dense | 512 | 4 | 32 | 978.97 | 143.56 | 936.71 | +| Dense | 2048 | 4 | 32 | 1023.61 | 184.09 | 1014.59 | + +Source check: + +- A C32 M5 candidate cannot be implemented as a launcher-only shortcut. +- The current M5 form-T apply path stores one 16-row tile of `U=T*RHS` in + registers, syncs, then overwrites `Ud`. That is safe for `C=16`. +- For `C=32`, a naive two-row-tile loop would overwrite RHS rows before all + output rows are computed, and the current apply call only covers rowbase `0`. +- A correct C32 slab candidate must add a separate staging strategy for all + `C*DV_TILE` U values, then run focused `GATED_DELTA_NET` op gates before any + S_PP comparison. + +Decision: + +- Do not ship a Phase 10 source patch yet. +- Keep the baseline and source check as the entry gate for the next C32 slab + implementation task. diff --git a/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md b/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md index 0b39883dd..3c53d1202 100644 --- a/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md +++ b/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md @@ -26,7 +26,7 @@ - Read-only: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu` - Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/` -- [ ] **Step 1: Check DGX is free** +- [x] **Step 1: Check DGX is free** Run the standard DGX preflight: @@ -47,7 +47,7 @@ compute=0 FREE... ``` -- [ ] **Step 2: Record current source provenance** +- [x] **Step 2: Record current source provenance** Run: @@ -57,7 +57,7 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git r Expected: clean or only the current phase commit. -- [ ] **Step 3: Run current M5 prefill baseline** +- [x] **Step 3: Run current M5 prefill baseline** Run MoE and dense prefill at `npp=512` and `npp=2048` with: @@ -71,12 +71,56 @@ Record S_PP, kernel bucket summaries, and artifacts under: /home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/ ``` +Result: + +| Model | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|----|----|---|----------|----------|-------| +| MoE | 512 | 4 | 32 | 2314.18 | 359.16 | 2220.48 | +| MoE | 2048 | 4 | 32 | 2439.95 | 389.43 | 2415.16 | +| Dense | 512 | 4 | 32 | 978.97 | 143.56 | 936.71 | +| Dense | 2048 | 4 | 32 | 1023.61 | 184.09 | 1014.59 | + +Artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_moe_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_dense_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/summary_rows.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/provenance.txt` + ## Task 2: Add Default-Off C32 Slab Candidate **Files:** - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` - Mirror: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu` +### Source Inspection Result + +- [x] **Step 0: Check whether C32 can reuse the current M5 body** + +Result: no safe launcher-only shortcut exists for C32 M5. + +The current M5 code path is structurally specialized to `C<=16` in the form-T +solve/apply stage: + +- `gated_delta_net_chunked_cuda` stores the full `U=T*RHS` + output in registers before overwriting `Ud`, avoiding read/write aliasing. +- For `C=16`, one `m16` row tile covers all chunk rows. +- For `C=32`, there are two row tiles. Writing the first tile to `Ud` before + computing the second would corrupt the RHS reads for the second tile. +- The current code also calls the apply helper with rowbase `0` only in the M5 + solve path, so a naive `launch_gdn_chunked<128, 32, TC=4>` would be + incomplete even if dynamic shared memory fit. + +Implication: + +- Do not add `GDN_C32_SLAB=1` by only changing launch dimensions. +- A correct C32 slab patch must first add a separate `U=T*RHS` staging strategy: + either a slab-local temporary buffer for all `C*DV_TILE` U values, or a + two-pass apply that preserves the original RHS until all row tiles are + computed. +- Because the candidate changes the solve/apply mechanics, it requires a + focused `GATED_DELTA_NET` op gate before any prefill A/B. + - [ ] **Step 1: Add an explicit env selector** Use an env var such as: