patches(paged): pad W4A16 A shared tile stride

Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-06-30 22:15:21 +00:00
parent 8b413d1cbd
commit 85c88320ef
4 changed files with 164 additions and 9 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -270,6 +270,47 @@ Result:
 - Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or
  scheduling cost exceeds the saved duplicate scale conversion.

+## W4A16 Shared-Memory Padding Phase 4
+
+Goal: reduce bank pressure in `w4a16_grouped_kernel` by padding the A operand
+shared-memory row stride while preserving math order and launch shape.
+
+Fork commit: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
+(`feat(paged): pad W4A16 A shared tile stride`).
+
+LocalAI patch mirror: `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`.
+
+Artifacts:
+
+- Build: `~/llama-w4a16-phase4`
+- Logs: `~/bench/w4a16_phase4`
+
+Gates:
+
+- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Forced W4A16 `bm32` and old `base` shape md5s matched each other:
+  `07db32c2bcb78d17a43ed18bc22705cd`.
+- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0.
+
+Performance:
+
+| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
+|-------|--------------|---------------|----------|
+| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline |
+| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected |
+| Phase 2 `base` | 1310.13 | 1336.02 | baseline |
+| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic |
+
+Result:
+
+- Kept. Default W4A16 `bm32` improves another `+1.7%` at `npp=512` and
+  `+1.6%` at `npp=2048` versus Phase 2.
+- Applying all 41 LocalAI `patches/paged/*.patch` files to base pin
+  `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
+  `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`.
+- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`.
+
 ## Clean Build

 First clean build attempt:
@@ -311,16 +352,16 @@ Second clean build attempt:

 - Local llama.cpp fork: `/home/mudler/_git/llama.cpp`
 - Branch: `localai-paged`
- Working tree: clean after fork commit `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
+- Working tree: clean after fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
 - Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
- Current HEAD: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
+- Current HEAD: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
 - Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
 - Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
- LocalAI patch count: `38` at Phase 0; current mirror count is `40` after
-  patch `0049`.
+- LocalAI patch count: `38` at Phase 0; current mirror count is `41` after
+  patch `0050`.
 - LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
  HEAD.
- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`
+- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`

 ## Existing Artifact Gap Review

--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
 - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
 - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
 - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `7dfa0e175` is mirrored by worktree patch `0049-feat-paged-tune-W4A16-grouped-tile-shape.patch`; W4A16 packed metadata is worktree patch `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)
+- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `d9b9be0be` is mirrored by worktree patch `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`; W4A16 grouped tile shape is worktree patch `0049`, packed metadata is `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)

 ### 2.6 Bench hygiene gates
 - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX

 ### Fork (canonical source of truth)
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` ("tune W4A16 grouped tile shape", patch `0049`). The DGX checkout itself may still be dirty and must not be treated as canonical.
+- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` ("pad W4A16 A shared tile stride", patch `0050`). The DGX checkout itself may still be dirty and must not be treated as canonical.
 - `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.

 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **40** `.patch` files spanning 0001-0049 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default.
+- `patches/paged/`: **41** `.patch` files spanning 0001-0050 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default; W4A16 A shared-tile padding is 0050.

 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual

 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`7dfa0e175`, patch 0049) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
+2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`d9b9be0be`, patch 0050) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
 3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch
@@ -0,0 +1,56 @@
+From d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 30 Jun 2026 22:13:09 +0000
+Subject: [PATCH] feat(paged): pad W4A16 A shared tile stride
+
+Pad the grouped W4A16 A operand shared-memory row stride to reduce bank pressure on GB10 while preserving the selected bm32/default and base launch shapes.
+
+Assisted-by: Codex:gpt-5
+---
+ ggml/src/ggml-cuda/w4a16-gemm.cu | 9 +++++----
+ 1 file changed, 5 insertions(+), 4 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
+index ca8864292..b17e022b0 100644
+--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
+@@ -111,7 +111,8 @@ static __global__ void w4a16_grouped_kernel(
+     constexpr int MF = WM/16, NF = WN/8;
+ 
+     constexpr int AN  = BK/2;                 // bf16 pairs per A smem row (nv_bfloat162)
+-    constexpr int SZ_A   = BM*AN;             // nv_bfloat162 per stage
+    constexpr int ASTR = AN + 4;              // padded A smem row stride, in nv_bfloat162
+    constexpr int SZ_A   = BM*ASTR;           // nv_bfloat162 per stage
+     constexpr int SZ_WQ  = BN*8;              // u32 per stage (32 qs bytes/row)
+     constexpr int SZ_WD  = BN;                // u32 per stage (4 scale bytes/row)
+ 
+@@ -155,7 +156,7 @@ static __global__ void w4a16_grouped_kernel(
+             const int c = idx % (BK/8);          // 16B chunk in the row
+             const int r = idx / (BK/8);          // row in tile
+             const nv_bfloat16 * src = Abf + (int64_t)(row0 + r)*K + (int64_t)kt*BK + c*8;
+-            w4a16_cp_async16(((char *) sA[st]) + (r*AN + c*4)*sizeof(uint32_t), src);
+            w4a16_cp_async16(((char *) sA[st]) + (r*ASTR + c*4)*sizeof(uint32_t), src);
+         }
+         // W qs: BN rows x 32 bytes = BN x 8 u32 (each block's qs at byte offset 4)
+ #pragma unroll 1
+@@ -198,7 +199,7 @@ static __global__ void w4a16_grouped_kernel(
+ #pragma unroll
+             for (int mi = 0; mi < MF; mi++) {
+                 const int rb = wrow*WM + mi*16;
+-                load_ldmatrix(A_frag[mi], sAcur + rb*AN + kk*8, AN);
+                load_ldmatrix(A_frag[mi], sAcur + rb*ASTR + kk*8, ASTR);
+             }
+             // B fragments: in-register FP4->bf16 dequant (correct-by-construction via tile get_i/get_j)
+             tile_B B_frag[NF];
+@@ -368,7 +369,7 @@ static void ggml_cuda_mul_mat_id_w4a16_grouped_impl(
+     const int64_t expert_stride_blocks = (int64_t) (src0->nb[2] / sizeof(block_nvfp4));
+ 
+     auto kern = w4a16_grouped_kernel<BM, BN, WARPS_M, WARPS_N, STAGES>;
+-    constexpr int STAGE_U32 = BM*(64/2) + BN*8 + BN;
+    constexpr int STAGE_U32 = BM*((64/2) + 4) + BN*8 + BN;
+     const int smem_bytes = STAGES * STAGE_U32 * (int) sizeof(uint32_t);
+     CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes));
+ 
+-- 
+2.43.0
+