patches(paged): tune W4A16 grouped tile shape

Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-06-30 21:57:42 +00:00
parent d8edc615e7
commit c5f2545cdd
4 changed files with 272 additions and 9 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -171,6 +171,70 @@ Result:
 - Decision: keep patch `0048` as a small simplification, but pivot the next
  W4A16 iteration to the activation cast or MMA/dequant tile body.

+## W4A16 Kernel Shape Phase 2
+
+Profile-guided target:
+
+- Phase 1 forced W4A16 profile at `npp=512`: `w4a16_grouped_kernel` dominated
+  at `5231.667 ms` (`47.8%`) while `w4a16_cast_act_f32_bf16` was `517.195 ms`
+  (`4.7%`).
+- Phase 2 therefore targeted grouped-kernel tile shape/body before activation
+  cast fusion.
+
+Shape sweep artifacts:
+
+- Build: `~/llama-w4a16-phase2`
+- Benchmarks: `~/bench/w4a16_phase2/shape_*.txt`
+- Winning profile: `~/bench/w4a16_phase2/profile/w4a16_bm32_npp512.*`
+
+Shape A/B:
+
+| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
+|-------|--------------|---------------|----------|
+| `base` / `64x128` | 1308.02 | 1339.46 | old baseline |
+| `bn256` | 1286.99 | 1311.56 | rejected |
+| `bm32` / `32x128` | 1442.99 | 1475.65 | selected |
+| `bn64` | 1334.80 | 1362.55 | diagnostic only |
+| `stages3` | 1271.01 | 1295.96 | rejected |
+| `bn256x16` | 1084.66 | 1100.95 | rejected |
+
+Only `bm32` and the old `base` selector are shipped in patch `0049`. The other
+candidate shapes were benchmarked in the Phase 2 build and then deliberately
+left out to keep the upstream conflict surface small.
+
+Default-verification after selecting `bm32`:
+
+| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
+|----|----|---|------|--------|----------|--------|----------|-----|-------|
+| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 |
+| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 |
+
+Result:
+
+- `bm32` improves forced W4A16 by about `+10.4%` at `npp=512` and `+10.2%`
+  at `npp=2048` versus the old `64x128` shape in the same sweep.
+- The profiled `bm32` grouped kernel dropped to `4107.355 ms` (`41.7%`) at
+  `npp=512`, from Phase 1's `5231.667 ms` (`47.8%`).
+- Canonical post-change gates matched: MoE
+  `8cb0ce23777bf55f92f63d0292c756b0`, dense
+  `5951a5b4d624ce891e22ab5fca9bc439`.
+- Forced W4A16 shape gates matched each other: `LLAMA_W4A16_PREFILL_M=1`
+  default `bm32` and `LLAMA_W4A16_SHAPE=base` both produced
+  `07db32c2bcb78d17a43ed18bc22705cd` on the canonical gate prompt.
+- Forced W4A16 `MUL_MAT_ID` op checks passed for both shapes:
+  `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` reported `806/806`
+  for default `bm32` and `806/806` for `base`.
+- Decision: make `bm32` the W4A16 default shape while keeping
+  `LLAMA_W4A16_SHAPE=base` for old-shape A/B and leaving other candidates as
+  diagnostics.
+
+Mirror invariant after patch `0049`:
+
+- Applying all 40 LocalAI `patches/paged/*.patch` files to base pin
+  `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
+  `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`.
+- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`.
+
 ## Clean Build

 First clean build attempt:
@@ -212,15 +276,16 @@ Second clean build attempt:

 - Local llama.cpp fork: `/home/mudler/_git/llama.cpp`
 - Branch: `localai-paged`
- Working tree: clean
- HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
+- Working tree: clean after fork commit `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
+- Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
+- Current HEAD: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
 - Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
 - Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
- LocalAI patch count: `38` at Phase 0; current mirror count is `39` after
-  patch `0048`.
+- LocalAI patch count: `38` at Phase 0; current mirror count is `40` after
+  patch `0049`.
 - LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
  HEAD.
- Tree hash after patch application: `a73d759350277532a14e853e1fe78f08bbb74ce8`
+- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`

 ## Existing Artifact Gap Review

--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
 - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
 - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
 - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `4b0cc1163` is mirrored by the new worktree `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)
+- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `7dfa0e175` is mirrored by worktree patch `0049-feat-paged-tune-W4A16-grouped-tile-shape.patch`; W4A16 packed metadata is worktree patch `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)

 ### 2.6 Bench hygiene gates
 - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX

 ### Fork (canonical source of truth)
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17` ("pack W4A16 grouped tile metadata", patch `0048`). The DGX checkout itself may still be dirty and must not be treated as canonical.
+- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` ("tune W4A16 grouped tile shape", patch `0049`). The DGX checkout itself may still be dirty and must not be treated as canonical.
 - `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.

 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **39** `.patch` files spanning 0001-0048 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048.
+- `patches/paged/`: **40** `.patch` files spanning 0001-0049 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default.

 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual

 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`4b0cc1163`, patch 0048) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
+2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`7dfa0e175`, patch 0049) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
 3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-feat-paged-tune-W4A16-grouped-tile-shape.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-feat-paged-tune-W4A16-grouped-tile-shape.patch
@@ -0,0 +1,93 @@
+From 7dfa0e17548c5f04f83d2cc2a057b0a9941b599a Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 30 Jun 2026 21:44:54 +0000
+Subject: [PATCH] feat(paged): tune W4A16 grouped tile shape
+
+Select a BM=32 grouped W4A16 MoE prefill shape by default after the GB10 shape sweep, while keeping the prior 64x128 shape and additional diagnostics selectable through LLAMA_W4A16_SHAPE.
+
+Assisted-by: Codex:gpt-5
+---
+ ggml/src/ggml-cuda/w4a16-gemm.cu | 42 +++++++++++++++++++++++++++++---
+ 1 file changed, 38 insertions(+), 4 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
+index 899e1a23f..ca8864292 100644
+--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
+@@ -4,6 +4,7 @@
+ #include <algorithm>
+ #include <cstdint>
+ #include <cstdlib>
+#include <cstring>
+ #include <vector>
+ 
+ // ===========================================================================
+@@ -281,7 +282,23 @@ bool ggml_cuda_w4a16_moe_grouped_should_engage(
+     return true;
+ }
+ 
+-void ggml_cuda_mul_mat_id_w4a16_grouped(
+static bool ggml_cuda_w4a16_use_base_shape() {
+    static const bool use_base = [] {
+        const char * e = getenv("LLAMA_W4A16_SHAPE");
+        if (e == nullptr || e[0] == '\0' || strcmp(e, "default") == 0 || strcmp(e, "bm32") == 0 || strcmp(e, "32x128") == 0) {
+            return false;
+        }
+        if (strcmp(e, "base") == 0 || strcmp(e, "64x128") == 0) {
+            return true;
+        }
+        fprintf(stderr, "[w4a16] unknown LLAMA_W4A16_SHAPE=%s, using default bm32\n", e);
+        return false;
+    }();
+    return use_base;
+}
+
+template<int BM, int BN, int WARPS_M, int WARPS_N, int STAGES>
+static void ggml_cuda_mul_mat_id_w4a16_grouped_impl(
+         ggml_backend_cuda_context & ctx,
+         const ggml_tensor * src0,
+         const float * src1_sorted,
+@@ -290,9 +307,7 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
+         int64_t n_experts, int64_t K, int64_t N,
+         cudaStream_t stream) {
+     GGML_ASSERT(src0->type == GGML_TYPE_NVFP4);
+-    GGML_ASSERT(N % 128 == 0 && K % 64 == 0);
+-
+-    constexpr int BM = 64, BN = 128, WARPS_M = 2, WARPS_N = 4, STAGES = 2;
+    GGML_ASSERT(N % BN == 0 && K % 64 == 0);
+ 
+     // host: build the per-M-tile expert map (ragged, no tile crosses an expert boundary)
+     int64_t total_rows = 0;
+@@ -327,6 +342,8 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
+         }
+         fprintf(stderr, "[w4a16] engaged: total_rows=%lld n_experts=%lld K=%lld N=%lld n_tiles=%d max_tpe=%d multi_tile_experts=%d\n",
+                 (long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi);
+        fprintf(stderr, "[w4a16] shape: BM=%d BN=%d WARPS_M=%d WARPS_N=%d STAGES=%d\n",
+                BM, BN, WARPS_M, WARPS_N, STAGES);
+     }
+ 
+     // device: packed tile map; one pageable H2D copy instead of three tiny copies
+@@ -363,3 +380,20 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
+         (int) N, (int) K, expert_stride_blocks);
+     CUDA_CHECK(cudaGetLastError());
+ }
+
+void ggml_cuda_mul_mat_id_w4a16_grouped(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0,
+        const float * src1_sorted,
+        float * dst_sorted,
+        const int * tokens_per_expert,
+        int64_t n_experts, int64_t K, int64_t N,
+        cudaStream_t stream) {
+    if (ggml_cuda_w4a16_use_base_shape()) {
+        ggml_cuda_mul_mat_id_w4a16_grouped_impl<64, 128, 2, 4, 2>(
+                ctx, src0, src1_sorted, dst_sorted, tokens_per_expert, n_experts, K, N, stream);
+    } else {
+        ggml_cuda_mul_mat_id_w4a16_grouped_impl<32, 128, 1, 4, 2>(
+                ctx, src0, src1_sorted, dst_sorted, tokens_per_expert, n_experts, K, N, stream);
+    }
+}
+-- 
+2.43.0
+
--- a/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md
+++ b/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md
@@ -0,0 +1,105 @@
+# W4A16 Kernel Shape Phase 2 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing.
+
+**Goal:** Attack the remaining W4A16 prefill gap at the grouped kernel body, not metadata.
+
+**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; LocalAI patch series is regenerated only after the fork commit is validated. Keep W4A16 default-off unless `LLAMA_W4A16_PREFILL_M > 0`.
+
+## Task 1: Profile-Guided Target Selection
+
+- [x] Run `nsys` for default FP4-MMQ and forced W4A16 at `npp=512`.
+- [x] Compare kernel attribution for metadata/cast/body costs.
+- [x] Decide next implementation target from measured cost, not speculation.
+
+Result: `w4a16_grouped_kernel` is the dominant forced-W4A16 cost (`5231.667 ms`, `47.8%` of profiled GPU kernel time). `w4a16_cast_act_f32_bf16` is visible but much smaller (`517.195 ms`, `4.7%`). Phase 2 targets grouped-kernel tile shape/body first.
+
+## Task 2: Runtime Shape Selector
+
+**Files:**
+- Modify fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu`
+
+- [x] Add a small runtime selector for W4A16 grouped-kernel shape experiments.
+- [x] Preserve the current `64x128` shape as the default path.
+- [x] Add multiple candidate specializations behind an environment selector: a vLLM-inspired wider-`N` candidate, a ragged-M candidate, an occupancy candidate, and a deeper pipeline candidate.
+- [x] Keep launch and shared-memory calculations template-safe for each specialization.
+
+## Task 3: DGX Validation And Kill Gate
+
+- [x] Build the fork on DGX from the updated source snapshot.
+- [x] Run canonical paged MoE and dense greedy md5 gates after the final code change.
+- [x] Confirm gate hashes match the established inferencing references before committing.
+- [x] Run forced W4A16 A/B for default shape and candidate shape at `npp=512,2048`.
+- [x] Run forced W4A16 `MUL_MAT_ID` op checks for selected `bm32` and old `base`.
+- [x] Profile the winning candidate if it improves enough to understand the new bottleneck.
+- [x] Record whether the candidate improves, regresses, or is neutral.
+
+Initial candidates:
+
+- `default` / `64x128`: current Phase 1 shape.
+- `bn256`: wider N reuse, inspired by vLLM large-batch Marlin config.
+- `bm32`: smaller M tiles for ragged MoE expert tails.
+- `bn64`: smaller N tiles to test occupancy/latency limits.
+- `stages3`: current tile shape with deeper `cp.async` pipeline.
+
+Kill gate: keep a shape candidate as the new default only if it improves forced W4A16 prefill throughput by at least 3% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. Otherwise revert or leave it as an off-by-env diagnostic only if it is useful for future sweeps.
+
+## Task 4: Mirror And Document
+
+- [x] Commit the accepted fork-first result with `Assisted-by: Codex:gpt-5`.
+- [x] Regenerate only the new LocalAI patch mirror entry.
+- [x] Verify the full LocalAI patch mirror applies to the base pin and matches fork HEAD.
+- [x] Update `PARITY_HANDOFF.md` and phase results with artifact paths and decision.
+- [x] Commit the LocalAI mirror/docs result with `Assisted-by: Codex:gpt-5`.
+
+Artifacts:
+
+- Profile directory: `~/bench/w4a16_phase1/profile`
+- Candidate build directory: `~/llama-w4a16-phase2`
+- Candidate benchmark directory: `~/bench/w4a16_phase2`
+
+Result:
+
+| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
+|-------|--------------|---------------|----------|
+| `base` / `64x128` | 1308.02 | 1339.46 | old baseline |
+| `bn256` | 1286.99 | 1311.56 | rejected |
+| `bm32` / `32x128` | 1442.99 | 1475.65 | selected |
+| `bn64` | 1334.80 | 1362.55 | diagnostic only |
+| `stages3` | 1271.01 | 1295.96 | rejected |
+| `bn256x16` | 1084.66 | 1100.95 | rejected |
+
+Only `bm32` and the old `base` selector are shipped in patch `0049`. The other
+candidate shapes were benchmarked in the Phase 2 build and then deliberately
+left out to keep the upstream conflict surface small.
+
+Follow-up default verification with `LLAMA_W4A16_SHAPE` unset:
+
+| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
+|----|----|---|------|--------|----------|--------|----------|-----|-------|
+| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 |
+| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 |
+
+Profile:
+
+- `bm32` `w4a16_grouped_kernel`: `4107.355 ms` (`41.7%`) at profiled `npp=512`.
+- Phase 1 `64x128` `w4a16_grouped_kernel`: `5231.667 ms` (`47.8%`) at profiled `npp=512`.
+
+Canonical post-change gates:
+
+- MoE command: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null | md5sum`
+- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0` (matched canonical paged MoE reference).
+- Dense command: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null | md5sum`
+- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439` (matched canonical dense reference).
+- Forced W4A16 `bm32` md5 with `LLAMA_W4A16_PREFILL_M=1`: `07db32c2bcb78d17a43ed18bc22705cd`.
+- Forced W4A16 `base` md5 with `LLAMA_W4A16_PREFILL_M=1 LLAMA_W4A16_SHAPE=base`: `07db32c2bcb78d17a43ed18bc22705cd`.
+- Forced W4A16 shape md5 status: PASS, selected `bm32` is byte-identical to old `base` on the gate prompt.
+- Forced W4A16 `MUL_MAT_ID` op check: `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` passed `806/806` for both `bm32` and `base`.
+- Inference gate status: PASS before fork commit and LocalAI mirror commit.
+
+Mirror verification:
+
+- Applying all 40 `patches/paged/*.patch` files to base pin
+  `0ed235ea2c17a19fc8238668653946721ed136fd` reproduces fork HEAD
+  `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` by tree hash:
+  `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`.