From 85c88320ef0b8340ca2a1fa99293f73e5b13cada Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 30 Jun 2026 22:15:21 +0000 Subject: [PATCH] patches(paged): pad W4A16 A shared tile stride Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 51 ++++++++++++++-- .../docs/PARITY_HANDOFF.md | 8 +-- ...paged-pad-W4A16-A-shared-tile-stride.patch | 56 ++++++++++++++++++ .../2026-06-30-w4a16-shmem-pad-phase4.md | 58 +++++++++++++++++++ 4 files changed, 164 insertions(+), 9 deletions(-) create mode 100644 backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch create mode 100644 docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index b98eb1f64..b060caa49 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -270,6 +270,47 @@ Result: - Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or scheduling cost exceeds the saved duplicate scale conversion. +## W4A16 Shared-Memory Padding Phase 4 + +Goal: reduce bank pressure in `w4a16_grouped_kernel` by padding the A operand +shared-memory row stride while preserving math order and launch shape. + +Fork commit: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` +(`feat(paged): pad W4A16 A shared tile stride`). + +LocalAI patch mirror: `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`. + +Artifacts: + +- Build: `~/llama-w4a16-phase4` +- Logs: `~/bench/w4a16_phase4` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` and old `base` shape md5s matched each other: + `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic | + +Result: + +- Kept. Default W4A16 `bm32` improves another `+1.7%` at `npp=512` and + `+1.6%` at `npp=2048` versus Phase 2. +- Applying all 41 LocalAI `patches/paged/*.patch` files to base pin + `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD + `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`. +- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`. + ## Clean Build First clean build attempt: @@ -311,16 +352,16 @@ Second clean build attempt: - Local llama.cpp fork: `/home/mudler/_git/llama.cpp` - Branch: `localai-paged` -- Working tree: clean after fork commit `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` +- Working tree: clean after fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` - Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` -- Current HEAD: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` +- Current HEAD: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` - Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd` - Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd` -- LocalAI patch count: `38` at Phase 0; current mirror count is `40` after - patch `0049`. +- LocalAI patch count: `38` at Phase 0; current mirror count is `41` after + patch `0050`. - LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork HEAD. -- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9` +- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087` ## Existing Artifact Gap Review diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index b219852fa..c47d36865 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export. - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source. - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated. -- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `7dfa0e175` is mirrored by worktree patch `0049-feat-paged-tune-W4A16-grouped-tile-shape.patch`; W4A16 packed metadata is worktree patch `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.) +- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `d9b9be0be` is mirrored by worktree patch `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`; W4A16 grouped tile shape is worktree patch `0049`, packed metadata is `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.) ### 2.6 Bench hygiene gates - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS"). @@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ## 7. KEY FILE / ARTIFACT INDEX ### Fork (canonical source of truth) -- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` ("tune W4A16 grouped tile shape", patch `0049`). The DGX checkout itself may still be dirty and must not be treated as canonical. +- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` ("pad W4A16 A shared tile stride", patch `0050`). The DGX checkout itself may still be dirty and must not be treated as canonical. - `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series. ### LocalAI worktree - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit). - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`). - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`. -- `patches/paged/`: **40** `.patch` files spanning 0001-0049 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default. +- `patches/paged/`: **41** `.patch` files spanning 0001-0050 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default; W4A16 A shared-tile padding is 0050. ### Bench artifacts (DGX) - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run. @@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels) 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building. -2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`7dfa0e175`, patch 0049) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers. +2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`d9b9be0be`, patch 0050) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers. 3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`. 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch). 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign. diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch new file mode 100644 index 000000000..b69a895be --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch @@ -0,0 +1,56 @@ +From d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 30 Jun 2026 22:13:09 +0000 +Subject: [PATCH] feat(paged): pad W4A16 A shared tile stride + +Pad the grouped W4A16 A operand shared-memory row stride to reduce bank pressure on GB10 while preserving the selected bm32/default and base launch shapes. + +Assisted-by: Codex:gpt-5 +--- + ggml/src/ggml-cuda/w4a16-gemm.cu | 9 +++++---- + 1 file changed, 5 insertions(+), 4 deletions(-) + +diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu +index ca8864292..b17e022b0 100644 +--- a/ggml/src/ggml-cuda/w4a16-gemm.cu ++++ b/ggml/src/ggml-cuda/w4a16-gemm.cu +@@ -111,7 +111,8 @@ static __global__ void w4a16_grouped_kernel( + constexpr int MF = WM/16, NF = WN/8; + + constexpr int AN = BK/2; // bf16 pairs per A smem row (nv_bfloat162) +- constexpr int SZ_A = BM*AN; // nv_bfloat162 per stage ++ constexpr int ASTR = AN + 4; // padded A smem row stride, in nv_bfloat162 ++ constexpr int SZ_A = BM*ASTR; // nv_bfloat162 per stage + constexpr int SZ_WQ = BN*8; // u32 per stage (32 qs bytes/row) + constexpr int SZ_WD = BN; // u32 per stage (4 scale bytes/row) + +@@ -155,7 +156,7 @@ static __global__ void w4a16_grouped_kernel( + const int c = idx % (BK/8); // 16B chunk in the row + const int r = idx / (BK/8); // row in tile + const nv_bfloat16 * src = Abf + (int64_t)(row0 + r)*K + (int64_t)kt*BK + c*8; +- w4a16_cp_async16(((char *) sA[st]) + (r*AN + c*4)*sizeof(uint32_t), src); ++ w4a16_cp_async16(((char *) sA[st]) + (r*ASTR + c*4)*sizeof(uint32_t), src); + } + // W qs: BN rows x 32 bytes = BN x 8 u32 (each block's qs at byte offset 4) + #pragma unroll 1 +@@ -198,7 +199,7 @@ static __global__ void w4a16_grouped_kernel( + #pragma unroll + for (int mi = 0; mi < MF; mi++) { + const int rb = wrow*WM + mi*16; +- load_ldmatrix(A_frag[mi], sAcur + rb*AN + kk*8, AN); ++ load_ldmatrix(A_frag[mi], sAcur + rb*ASTR + kk*8, ASTR); + } + // B fragments: in-register FP4->bf16 dequant (correct-by-construction via tile get_i/get_j) + tile_B B_frag[NF]; +@@ -368,7 +369,7 @@ static void ggml_cuda_mul_mat_id_w4a16_grouped_impl( + const int64_t expert_stride_blocks = (int64_t) (src0->nb[2] / sizeof(block_nvfp4)); + + auto kern = w4a16_grouped_kernel; +- constexpr int STAGE_U32 = BM*(64/2) + BN*8 + BN; ++ constexpr int STAGE_U32 = BM*((64/2) + 4) + BN*8 + BN; + const int smem_bytes = STAGES * STAGE_U32 * (int) sizeof(uint32_t); + CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes)); + +-- +2.43.0 + diff --git a/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md b/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md new file mode 100644 index 000000000..94e75370c --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md @@ -0,0 +1,58 @@ +# W4A16 Shared-Memory Padding Phase 4 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Test whether padding the grouped W4A16 A tile in shared memory reduces bank conflicts after Phase 2 selected `bm32`. + +**Scope:** Fork-first experiment only. Keep the patch small, preserve math order, and ship no patch unless it passes md5/op gates and improves prefill. + +## Task 1: Implement A-Tile Padding + +- [x] Add a small shared-memory row-stride constant for `sA`. +- [x] Pad `sA` rows by 4 `uint32_t` slots while keeping 16-byte chunk alignment. +- [x] Update only A-copy and `ldmatrix` indexing; do not change W loads, dequant, MMA order, metadata, or launch shape. + +## Task 2: Gates + +- [x] Build `llama-batched-bench`, `llama-completion`, and `test-backend-ops` on DGX. +- [x] Run canonical default-off paged MoE and dense greedy md5 gates. +- [x] Run forced W4A16 `bm32` vs `base` md5 gates. +- [x] Run forced W4A16 `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1`. +- [x] Run W4A16 default `bm32` A/B against Phase 2 at `npp=512,2048`. + +## Task 3: Disposition + +- [x] Keep only if it improves W4A16 prefill by at least 1% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. +- [x] If kept, commit fork-first with `Assisted-by: Codex:gpt-5`, generate patch `0050`, verify mirror tree hash, update docs, and commit LocalAI. +- [ ] If rejected, revert the fork experiment and record the result without adding a patch. + +Result: kept as fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` and LocalAI patch `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`. + +Artifacts: + +- Build: `~/llama-w4a16-phase4` +- Logs: `~/bench/w4a16_phase4` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `base` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic | + +Mirror verification: + +- Applying all 41 `patches/paged/*.patch` files to base pin + `0ed235ea2c17a19fc8238668653946721ed136fd` reproduces fork HEAD + `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` by tree hash: + `8fcb151e0620fd0fc82b80c04318e5c34320b087`.