From 2074b4fb5b31f0efd7d42255eccdb274b9e8f73a Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 01:51:53 +0000 Subject: [PATCH] docs(paged): reject GDN global Ai32 prototype Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 60 ++++++++++ .../docs/GDN_SHARED_AI_COST_MODEL.md | 30 +++++ .../docs/PARITY_HANDOFF.md | 7 +- .../docs/VLLM_PARITY_FINAL.md | 10 +- .../docs/VLLM_PARITY_LEVER_MAP.md | 30 +++++ ...6-07-01-gdn-global-ai-prototype-phase13.md | 106 ++++++++++++++---- ...-07-01-gdn-shared-ai-cost-model-phase12.md | 2 +- 7 files changed, 215 insertions(+), 30 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 916c4f463..28e0a5af1 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1025,3 +1025,63 @@ Decision: - Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`. - The prototype must be rejected if it is flat or slower; do not iterate into f16/BF16 Ai unless f32 proves the schedule can win. + +## Phase 13 GDN Global-Ai32 Prototype Rejection + +Phase 13 implemented the Phase 12 design in the llama.cpp fork as a default-off +prototype behind `GDN_GLOBAL_AI32=1`. + +Implementation summary: + +- Added a f32 Ai precompute kernel. +- Added C32, `dv_tile=64` slab consumption through the chunked GDN path. +- Allocated Ai scratch from the ggml CUDA pool only for supported calls. +- Kept the default C16 M5 path unchanged. + +Correctness artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_default.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_global_ai32.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_default.md5` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_default.md5` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_global_ai32.md5` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_global_ai32.md5` + +Correctness result: + +- Default and Global-Ai32 paths matched canonical md5 exactly: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed. + +Performance artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt` + +Performance A/B: + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 | +| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 | +| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 | +| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 | +| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 | +| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 | +| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 | +| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 | + +Rejected diff: + +- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff` + +Conclusion: + +- Do not ship Phase 13 Global-Ai32 as implemented. +- The global scratch split is correctness-safe but slower than shipped C16 M5. +- Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining + vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware + assumptions that do not fit this GB10 patch stack without a regression. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md index 27094f5c8..404f36abc 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md @@ -140,3 +140,33 @@ Phase 13 constraints: - If md5 changes, run KL before benchmarking. - If the prototype is flat or slower, reject it and stop GDN kernel work on GB10; do not iterate into f16 Ai until f32 proves the schedule can win. + +## Phase 13 Result + +Phase 13 implemented the f32 Global-Ai32 prototype and rejected it. + +Correctness: + +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + +Performance: + +| Model | Mode | PP | S_PP t/s | +|-------|------|----|----------| +| MoE | M5 base | 2048 | 2425.10 | +| MoE | Global Ai32 | 2048 | 2097.76 | +| Dense | M5 base | 2048 | 1016.14 | +| Dense | Global Ai32 | 2048 | 918.19 | + +Artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/` +- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff` + +Final decision: + +- Reject Global-Ai32. +- Stop GDN kernel work on GB10. The remaining vLLM GDN advantage is not + reachable through the low-conflict C16/C32 patch shapes tested here. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 750e7295e..ece552a5e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -176,12 +176,13 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga | Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 | | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 | | Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | +| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | REJECTED | md5-clean, but slower: MoE 2048 2425.10 -> 2097.76; dense 2048 1016.14 -> 918.19 | Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation. -Phase 12 caveat: this is not a shipped win. It authorizes only a default-off -`GDN_GLOBAL_AI32=1` prototype. If Phase 13 is flat/slower, stop GDN kernel work -on GB10 instead of iterating into f16 Ai or more local reorders. +Phase 13 closes the caveat: the default-off `GDN_GLOBAL_AI32=1` prototype was +correctness-clean but slower. Stop GDN kernel work on GB10 instead of iterating +into f16 Ai or more local reorders. ### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady) | Lever | What | Verdict | Key number | diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index f8f955fde..ffd86a5cf 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -175,6 +175,7 @@ products through tensor cores. The series chased that headroom. | Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab | | Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab | | Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model | +| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | **REJECTED** | md5-clean, but S_PP regressed: MoE 2048 **2425.10 -> 2097.76**, dense 2048 **1016.14 -> 918.19** | phase13 gates/ab | **Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2) intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not @@ -186,11 +187,10 @@ intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly that and is the shipped winner; it does not fully close the 2.62x because vLLM's mature FLA blocked-solve is a more complete tensor-core implementation. -Post-record caveat: Phase 12 does not change the shipped verdict. It permits one -default-off `GDN_GLOBAL_AI32=1` prototype because global f32 Ai scratch is large -but not automatically disqualifying. If that prototype is flat or slower, GDN -kernel work on GB10 should stop rather than moving to f16 Ai or additional -local reorders. +Post-record caveat closed: Phase 13 tested the one permitted +`GDN_GLOBAL_AI32=1` prototype. It was correctness-clean but slower, so GDN kernel +work on GB10 should stop rather than moving to f16 Ai or additional local +reorders. ### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 59d410d9b..5f838c726 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -549,6 +549,36 @@ Docs: - `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md` - `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md` +### Phase 13 GDN Global-Ai32 update + +Phase 13 implemented the Phase 12 prototype behind `GDN_GLOBAL_AI32=1`: +precompute f32 Ai once per chunk/head, then consume it from two C32 +`dv_tile=64` value slabs. + +Result: + +- Correctness passed: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- Performance regressed: + - MoE 2048 S_PP `2425.10 -> 2097.76`. + - Dense 2048 S_PP `1016.14 -> 918.19`. + +Decision: + +- **REJECT** Global-Ai32. +- Do not add `0055`. +- Stop GDN kernel work on GB10. The shortcut space is exhausted by Phase 10, + Phase 11, and Phase 13 evidence; further GDN parity work needs a different + hardware regime or a larger FLA/CuteDSL-class implementation outside this + low-conflict LocalAI patch stack. + +Artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/` +- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff` + --- # PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026) diff --git a/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md b/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md index e7f99e9fc..448cae3bc 100644 --- a/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md +++ b/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md @@ -26,7 +26,7 @@ - Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` - Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/` -- [ ] **Step 1: Check DGX is free** +- [x] **Step 1: Check DGX is free** Run: @@ -47,7 +47,7 @@ compute=0 FREE... ``` -- [ ] **Step 2: Record provenance** +- [x] **Step 2: Record provenance** Run: @@ -59,7 +59,7 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git r Expected: both llama.cpp trees are clean. -- [ ] **Step 3: Create artifacts** +- [x] **Step 3: Create artifacts** Run: @@ -74,7 +74,7 @@ Expected: command exits 0. **Files:** - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` -- [ ] **Step 1: Add env selector in `ggml_cuda_op_gated_delta_net`** +- [x] **Step 1: Add env selector in `ggml_cuda_op_gated_delta_net`** Add after `keep_rs` is computed: @@ -85,7 +85,7 @@ static const bool gdn_global_ai32 = []{ }(); ``` -- [ ] **Step 2: Allocate Ai scratch only for supported calls** +- [x] **Step 2: Allocate Ai scratch only for supported calls** Add: @@ -102,7 +102,7 @@ if (gdn_global_ai32 && !kda && !keep_rs && S_v == 128 && n_tokens > 1) { Pass `ai32_d` and `ai32_chunks` into the non-KDA/non-keep launch call only. Other launch calls pass `nullptr, 0`. -- [ ] **Step 3: Extend `launch_gated_delta_net` signature** +- [x] **Step 3: Extend `launch_gated_delta_net` signature** Change the signature to include: @@ -117,7 +117,7 @@ before `float scale`. Thread these through all four call sites. **Files:** - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` -- [ ] **Step 1: Add `gdn_ai32_cuda`** +- [x] **Step 1: Add `gdn_ai32_cuda`** Add a kernel near `gated_delta_net_chunked_cuda`: @@ -160,7 +160,7 @@ if (j < C) { Use fixed stride `C` in scratch, zeroing out-of-range tail rows/columns. -- [ ] **Step 2: Add launcher** +- [x] **Step 2: Add launcher** Add: @@ -187,7 +187,7 @@ Dynamic smem: **Files:** - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` -- [ ] **Step 1: Add `gated_delta_net_chunked_ai32_cuda`** +- [x] **Step 1: Add `gated_delta_net_chunked_ai32_cuda`** Add a separate kernel rather than overloading the shipped M5 body: @@ -217,6 +217,14 @@ __global__ void gated_delta_net_chunked_ai32_cuda( } ``` +Result: + +- Implemented as a template extension of `gated_delta_net_chunked_cuda` instead + of a separately named kernel, to keep the patch smaller. +- Candidate was selected with `GDN_GLOBAL_AI32=1`. +- The implementation used C32, two `DV_TILE=64` slabs, f32 Ai scratch, and the + Phase 10 tail-row zeroing fix. + Use the Phase 10 tail-row fix: ```cpp @@ -229,7 +237,7 @@ and use full state stride for reads/writes: (int64_t) seq * H * S_v * S_v + (int64_t) h_idx * S_v * S_v ``` -- [ ] **Step 2: Add launcher** +- [x] **Step 2: Add launcher** Add: @@ -250,15 +258,19 @@ The smem formula must stay under the C32 slab Phase 10 budget: ```cpp ((size_t) S_v * DV_TILE + (size_t) 2 * C * S_v + (size_t) DV_TILE * C + (size_t) C * C + (size_t) 3 * C + (size_t) C * C - + (size_t) DV_TILE * C) * sizeof(float) ++ (size_t) DV_TILE * C) * sizeof(float) ``` +Result: + +- DGX build confirmed the smem shape compiled. + ## Task 5: Route Candidate **Files:** - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` -- [ ] **Step 1: Add route in `launch_gated_delta_net`** +- [x] **Step 1: Add route in `launch_gated_delta_net`** Before the existing `GDN_CHUNKED_LAUNCH` switch: @@ -273,7 +285,7 @@ if (ai32_d != nullptr && ai32_chunks > 0 && S_v == 128 && n_tokens >= gdn_chunk_ The route must require `!KDA && !keep_rs_t` via the existing template branch and must not trigger for decode-sized calls. -- [ ] **Step 2: Keep default path unchanged** +- [x] **Step 2: Keep default path unchanged** Run: @@ -283,12 +295,17 @@ git diff -- ggml/src/ggml-cuda/gated_delta_net.cu Check that default `GDN_TC=5` still launches `launch_gdn_chunked<128, 16, 4>`. +Result: + +- Default route stayed current M5. +- Candidate route required non-null Ai scratch from `GDN_GLOBAL_AI32=1`. + ## Task 6: Build and Correctness Gates **Files:** - Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/gates/` -- [ ] **Step 1: Mirror and build** +- [x] **Step 1: Mirror and build** Run: @@ -300,7 +317,7 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . Expected: build exits 0. -- [ ] **Step 2: Run op gates** +- [x] **Step 2: Run op gates** Run: @@ -313,7 +330,7 @@ GDN_GLOBAL_AI32=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o G Expected: both logs show CUDA0 OK for all cases. -- [ ] **Step 3: Run canonical md5 gates** +- [x] **Step 3: Run canonical md5 gates** Run default and candidate MoE/dense completion gates. Expected: @@ -324,12 +341,26 @@ Dense 5951a5b4d624ce891e22ab5fca9bc439 If candidate md5 differs, run the KL gate before benchmarking. +Result: + +- Build passed for `test-backend-ops`, `llama-completion`, and + `llama-batched-bench`. +- Default and forced `GDN_GLOBAL_AI32=1` op gates both reported the same OK + count. +- Default md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- Global-Ai32 md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed. + ## Task 7: Performance A/B **Files:** - Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/ab/` -- [ ] **Step 1: Run same-session A/B** +- [x] **Step 1: Run same-session A/B** Run MoE and dense: @@ -346,7 +377,7 @@ Use: Expected: candidate improves S_PP without dense regression. -- [ ] **Step 2: Decide** +- [x] **Step 2: Decide** Accept only if: @@ -357,6 +388,32 @@ Accept only if: Reject if flat or slower. +Result: + +Artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt` + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 | +| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 | +| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 | +| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 | +| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 | +| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 | +| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 | +| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 | + +Decision: + +- Reject the global-Ai32 source patch. +- The candidate is correctness-clean but slower in both model families. +- The global scratch/Ai split is not enough to beat the shipped C16 M5 on GB10. + ## Task 8: Mirror or Reject **Files:** @@ -366,7 +423,7 @@ Reject if flat or slower. - Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md` - Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` -- [ ] **Step 1: If accepted, commit fork patch and generate LocalAI patch** +- [x] **Step 1: If accepted, commit fork patch and generate LocalAI patch** Run: @@ -377,7 +434,7 @@ git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD --stdout \ > backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-GDN-global-Ai32-prefill-prototype.patch ``` -- [ ] **Step 2: If rejected, save diff and restore** +- [x] **Step 2: If rejected, save diff and restore** Run: @@ -388,7 +445,7 @@ git -C /home/mudler/_git/llama.cpp checkout -- ggml/src/ggml-cuda/gated_delta_ne ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git checkout -- ggml/src/ggml-cuda/gated_delta_net.cu' ``` -- [ ] **Step 3: Commit LocalAI docs** +- [x] **Step 3: Commit LocalAI docs** Commit accepted patch/docs or rejected docs with: @@ -396,3 +453,10 @@ Commit accepted patch/docs or rejected docs with: git commit -m "docs(paged): record GDN global Ai32 result" \ -m "Assisted-by: Codex:gpt-5" ``` + +Result: + +- No fork commit and no `0055` LocalAI patch were generated. +- Rejected diff saved at: + `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`. +- llama.cpp fork and DGX mirror were restored to the prior accepted state. diff --git a/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md b/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md index 5ecc56fd6..a32f6dc67 100644 --- a/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md +++ b/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md @@ -303,7 +303,7 @@ Result: - `/home/mudler/_git/llama.cpp` was clean. - DGX metadata artifact existed and contained MoE/dense GGUF metadata. -- [ ] **Step 2: Commit docs** +- [x] **Step 2: Commit docs** For GO: