From 85c88320ef0b8340ca2a1fa99293f73e5b13cada Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 30 Jun 2026 22:15:21 +0000
Subject: [PATCH] patches(paged): pad W4A16 A shared tile stride

Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.

Assisted-by: Codex:gpt-5
---
 .../docs/GB10_PARITY_PHASE0_RESULTS.md        | 51 ++++++++++++++--
 .../docs/PARITY_HANDOFF.md                    |  8 +--
 ...paged-pad-W4A16-A-shared-tile-stride.patch | 56 ++++++++++++++++++
 .../2026-06-30-w4a16-shmem-pad-phase4.md      | 58 +++++++++++++++++++
 4 files changed, 164 insertions(+), 9 deletions(-)
 create mode 100644 backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch
 create mode 100644 docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md

diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
index b98eb1f64..b060caa49 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -270,6 +270,47 @@ Result:
 - Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or
   scheduling cost exceeds the saved duplicate scale conversion.
 
+## W4A16 Shared-Memory Padding Phase 4
+
+Goal: reduce bank pressure in `w4a16_grouped_kernel` by padding the A operand
+shared-memory row stride while preserving math order and launch shape.
+
+Fork commit: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
+(`feat(paged): pad W4A16 A shared tile stride`).
+
+LocalAI patch mirror: `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`.
+
+Artifacts:
+
+- Build: `~/llama-w4a16-phase4`
+- Logs: `~/bench/w4a16_phase4`
+
+Gates:
+
+- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Forced W4A16 `bm32` and old `base` shape md5s matched each other:
+  `07db32c2bcb78d17a43ed18bc22705cd`.
+- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0.
+
+Performance:
+
+| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
+|-------|--------------|---------------|----------|
+| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline |
+| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected |
+| Phase 2 `base` | 1310.13 | 1336.02 | baseline |
+| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic |
+
+Result:
+
+- Kept. Default W4A16 `bm32` improves another `+1.7%` at `npp=512` and
+  `+1.6%` at `npp=2048` versus Phase 2.
+- Applying all 41 LocalAI `patches/paged/*.patch` files to base pin
+  `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
+  `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`.
+- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`.
+
 ## Clean Build
 
 First clean build attempt:
@@ -311,16 +352,16 @@ Second clean build attempt:
 
 - Local llama.cpp fork: `/home/mudler/_git/llama.cpp`
 - Branch: `localai-paged`
-- Working tree: clean after fork commit `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
+- Working tree: clean after fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
 - Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
-- Current HEAD: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
+- Current HEAD: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
 - Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
 - Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
-- LocalAI patch count: `38` at Phase 0; current mirror count is `40` after
-  patch `0049`.
+- LocalAI patch count: `38` at Phase 0; current mirror count is `41` after
+  patch `0050`.
 - LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
   HEAD.
-- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`
+- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`
 
 ## Existing Artifact Gap Review
 
diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
index b219852fa..c47d36865 100644
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
 - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
 - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
 - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
-- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `7dfa0e175` is mirrored by worktree patch `0049-feat-paged-tune-W4A16-grouped-tile-shape.patch`; W4A16 packed metadata is worktree patch `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)
+- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `d9b9be0be` is mirrored by worktree patch `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`; W4A16 grouped tile shape is worktree patch `0049`, packed metadata is `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)
 
 ### 2.6 Bench hygiene gates
 - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX
 
 ### Fork (canonical source of truth)
-- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` ("tune W4A16 grouped tile shape", patch `0049`). The DGX checkout itself may still be dirty and must not be treated as canonical.
+- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` ("pad W4A16 A shared tile stride", patch `0050`). The DGX checkout itself may still be dirty and must not be treated as canonical.
 - `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.
 
 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
-- `patches/paged/`: **40** `.patch` files spanning 0001-0049 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default.
+- `patches/paged/`: **41** `.patch` files spanning 0001-0050 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default; W4A16 A shared-tile padding is 0050.
 
 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 
 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`7dfa0e175`, patch 0049) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
+2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`d9b9be0be`, patch 0050) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
 3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch
new file mode 100644
index 000000000..b69a895be
--- /dev/null
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch
@@ -0,0 +1,56 @@
+From d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 30 Jun 2026 22:13:09 +0000
+Subject: [PATCH] feat(paged): pad W4A16 A shared tile stride
+
+Pad the grouped W4A16 A operand shared-memory row stride to reduce bank pressure on GB10 while preserving the selected bm32/default and base launch shapes.
+
+Assisted-by: Codex:gpt-5
+---
+ ggml/src/ggml-cuda/w4a16-gemm.cu | 9 +++++----
+ 1 file changed, 5 insertions(+), 4 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
+index ca8864292..b17e022b0 100644
+--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
++++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
+@@ -111,7 +111,8 @@ static __global__ void w4a16_grouped_kernel(
+     constexpr int MF = WM/16, NF = WN/8;
+ 
+     constexpr int AN  = BK/2;                 // bf16 pairs per A smem row (nv_bfloat162)
+-    constexpr int SZ_A   = BM*AN;             // nv_bfloat162 per stage
++    constexpr int ASTR = AN + 4;              // padded A smem row stride, in nv_bfloat162
++    constexpr int SZ_A   = BM*ASTR;           // nv_bfloat162 per stage
+     constexpr int SZ_WQ  = BN*8;              // u32 per stage (32 qs bytes/row)
+     constexpr int SZ_WD  = BN;                // u32 per stage (4 scale bytes/row)
+ 
+@@ -155,7 +156,7 @@ static __global__ void w4a16_grouped_kernel(
+             const int c = idx % (BK/8);          // 16B chunk in the row
+             const int r = idx / (BK/8);          // row in tile
+             const nv_bfloat16 * src = Abf + (int64_t)(row0 + r)*K + (int64_t)kt*BK + c*8;
+-            w4a16_cp_async16(((char *) sA[st]) + (r*AN + c*4)*sizeof(uint32_t), src);
++            w4a16_cp_async16(((char *) sA[st]) + (r*ASTR + c*4)*sizeof(uint32_t), src);
+         }
+         // W qs: BN rows x 32 bytes = BN x 8 u32 (each block's qs at byte offset 4)
+ #pragma unroll 1
+@@ -198,7 +199,7 @@ static __global__ void w4a16_grouped_kernel(
+ #pragma unroll
+             for (int mi = 0; mi < MF; mi++) {
+                 const int rb = wrow*WM + mi*16;
+-                load_ldmatrix(A_frag[mi], sAcur + rb*AN + kk*8, AN);
++                load_ldmatrix(A_frag[mi], sAcur + rb*ASTR + kk*8, ASTR);
+             }
+             // B fragments: in-register FP4->bf16 dequant (correct-by-construction via tile get_i/get_j)
+             tile_B B_frag[NF];
+@@ -368,7 +369,7 @@ static void ggml_cuda_mul_mat_id_w4a16_grouped_impl(
+     const int64_t expert_stride_blocks = (int64_t) (src0->nb[2] / sizeof(block_nvfp4));
+ 
+     auto kern = w4a16_grouped_kernel<BM, BN, WARPS_M, WARPS_N, STAGES>;
+-    constexpr int STAGE_U32 = BM*(64/2) + BN*8 + BN;
++    constexpr int STAGE_U32 = BM*((64/2) + 4) + BN*8 + BN;
+     const int smem_bytes = STAGES * STAGE_U32 * (int) sizeof(uint32_t);
+     CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes));
+ 
+-- 
+2.43.0
+
diff --git a/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md b/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md
new file mode 100644
index 000000000..94e75370c
--- /dev/null
+++ b/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md
@@ -0,0 +1,58 @@
+# W4A16 Shared-Memory Padding Phase 4 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing.
+
+**Goal:** Test whether padding the grouped W4A16 A tile in shared memory reduces bank conflicts after Phase 2 selected `bm32`.
+
+**Scope:** Fork-first experiment only. Keep the patch small, preserve math order, and ship no patch unless it passes md5/op gates and improves prefill.
+
+## Task 1: Implement A-Tile Padding
+
+- [x] Add a small shared-memory row-stride constant for `sA`.
+- [x] Pad `sA` rows by 4 `uint32_t` slots while keeping 16-byte chunk alignment.
+- [x] Update only A-copy and `ldmatrix` indexing; do not change W loads, dequant, MMA order, metadata, or launch shape.
+
+## Task 2: Gates
+
+- [x] Build `llama-batched-bench`, `llama-completion`, and `test-backend-ops` on DGX.
+- [x] Run canonical default-off paged MoE and dense greedy md5 gates.
+- [x] Run forced W4A16 `bm32` vs `base` md5 gates.
+- [x] Run forced W4A16 `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1`.
+- [x] Run W4A16 default `bm32` A/B against Phase 2 at `npp=512,2048`.
+
+## Task 3: Disposition
+
+- [x] Keep only if it improves W4A16 prefill by at least 1% at either `npp=512` or `npp=2048` without regressing the other by more than 1%.
+- [x] If kept, commit fork-first with `Assisted-by: Codex:gpt-5`, generate patch `0050`, verify mirror tree hash, update docs, and commit LocalAI.
+- [ ] If rejected, revert the fork experiment and record the result without adding a patch.
+
+Result: kept as fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` and LocalAI patch `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`.
+
+Artifacts:
+
+- Build: `~/llama-w4a16-phase4`
+- Logs: `~/bench/w4a16_phase4`
+
+Gates:
+
+- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Forced W4A16 `bm32` md5: `07db32c2bcb78d17a43ed18bc22705cd`.
+- Forced W4A16 `base` md5: `07db32c2bcb78d17a43ed18bc22705cd`.
+- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0.
+
+Performance:
+
+| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
+|-------|--------------|---------------|----------|
+| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline |
+| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected |
+| Phase 2 `base` | 1310.13 | 1336.02 | baseline |
+| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic |
+
+Mirror verification:
+
+- Applying all 41 `patches/paged/*.patch` files to base pin
+  `0ed235ea2c17a19fc8238668653946721ed136fd` reproduces fork HEAD
+  `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` by tree hash:
+  `8fcb151e0620fd0fc82b80c04318e5c34320b087`.