patches(paged): mirror W4A16 packed metadata

Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-06-30 21:21:53 +00:00
parent 1c0709b700
commit d8edc615e7
4 changed files with 208 additions and 5 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -128,6 +128,49 @@ First implementation target:
  many small ragged tile maps. The first fork-first experiment should remove or
  amortize that host-built tile-map path before retuning MMA tile shapes.

+## W4A16 Metadata Phase 1
+
+Fork commit: `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17`
+(`feat(paged): pack W4A16 grouped tile metadata`).
+
+LocalAI patch mirror: `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`.
+
+Mirror invariant: applying the full LocalAI `patches/paged/*.patch` series to
+base pin `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
+`4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17`.
+
+Artifacts:
+
+- Diff: `~/bench/w4a16_phase1/packed_desc.diff`
+- Build mtimes: `~/bench/w4a16_phase1/build_binary_mtimes.txt`
+- MoE gate: `~/bench/w4a16_phase1/gate_moe.md5`
+- Dense gate: `~/bench/w4a16_phase1/gate_dense.md5`
+- Default FP4-MMQ: `~/bench/w4a16_phase1/w4a16_off.txt`
+- Packed W4A16: `~/bench/w4a16_phase1/w4a16_on_thr64.txt`
+
+Canonical gates:
+
+- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0` (matched expected)
+- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439` (matched expected)
+
+Packed descriptor A/B:
+
+| Path | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
+|------|----|----|---|------|--------|----------|--------|----------|-----|-------|
+| FP4-MMQ | 512 | 4 | 32 | 16512 | 7.114 | 2303.07 | 0.323 | 396.55 | 7.437 | 2220.32 |
+| FP4-MMQ | 2048 | 4 | 32 | 65664 | 27.045 | 2423.23 | 0.331 | 387.14 | 27.376 | 2398.64 |
+| W4A16 packed | 512 | 4 | 32 | 16512 | 12.468 | 1314.08 | 0.322 | 397.97 | 12.790 | 1291.04 |
+| W4A16 packed | 2048 | 4 | 32 | 65664 | 48.930 | 1339.39 | 0.330 | 387.44 | 49.260 | 1333.00 |
+
+Result:
+
+- Packed descriptors improved forced W4A16 by `+0.39%` at `npp=512` and
+  `+0.48%` at `npp=2048` versus the Phase 0 no-debug W4A16 baseline.
+- W4A16 remains `-42.9%` at `npp=512` and `-44.7%` at `npp=2048` versus
+  same-run default FP4-MMQ.
+- Decision: keep patch `0048` as a small simplification, but pivot the next
+  W4A16 iteration to the activation cast or MMA/dequant tile body.
+
 ## Clean Build

 First clean build attempt:
@@ -173,7 +216,8 @@ Second clean build attempt:
 - HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
 - Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
 - Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
- LocalAI patch count: `38`
+- LocalAI patch count: `38` at Phase 0; current mirror count is `39` after
+  patch `0048`.
 - LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
  HEAD.
 - Tree hash after patch application: `a73d759350277532a14e853e1fe78f08bbb74ce8`
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
 - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
 - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
 - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `51168c5ee` "patch 0044" is byte-identical to worktree `0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)
+- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `4b0cc1163` is mirrored by the new worktree `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)

 ### 2.6 Bench hygiene gates
 - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 ## 7. KEY FILE / ARTIFACT INDEX

 ### Fork (canonical source of truth)
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, HEAD `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` ("fused gated RMSNorm + SiLU gate-mul CUDA op (patch 0044)"). **Currently dirty** (uncommitted `M ggml/src/ggml-cuda/gated_delta_net.cu`).
+- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17` ("pack W4A16 grouped tile metadata", patch `0048`). The DGX checkout itself may still be dirty and must not be treated as canonical.
 - `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.

 ### LocalAI worktree
 - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
 - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
 - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **38** `.patch` files spanning 0001-0047 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV).
+- `patches/paged/`: **39** `.patch` files spanning 0001-0048 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048.

 ### Bench artifacts (DGX)
 - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual

 ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
-2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the fork HEAD (`51168c5ee`, patch 0044) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
+2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`4b0cc1163`, patch 0048) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
 3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch
@@ -0,0 +1,121 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 30 Jun 2026 21:19:07 +0000
+Subject: [PATCH] feat(paged): pack W4A16 grouped tile metadata
+
+Collapse the grouped W4A16 tile metadata from three host vectors and three device copies into one aligned descriptor array and one H2D copy. This keeps the path default-off while reducing launch-side metadata overhead for the experimental W4A16 MoE prefill path.
+
+Assisted-by: Codex:gpt-5
+---
+ ggml/src/ggml-cuda/w4a16-gemm.cu | 46 ++++++++++++++++++--------------
+ 1 file changed, 26 insertions(+), 20 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
+index f348f31ad..899e1a23f 100644
+--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
+@@ -18,6 +18,14 @@ typedef tile<16, 8, nv_bfloat162> tile_A; // A operand: M=16, K=16
+ typedef tile< 8, 8, nv_bfloat162> tile_B; // B operand: N=8,  K=16
+ typedef tile<16, 8, float>        tile_C; // accumulator: M=16, N=8
+ 
+struct alignas(16) w4a16_tile_desc {
+    int expert;
+    int row0;
+    int rows;
+    int pad;
+};
+static_assert(sizeof(w4a16_tile_desc) == 16, "w4a16 tile descriptors must stay 16 bytes");
+
+ #ifndef LLAMA_W4A16_PREFILL_M
+ #define LLAMA_W4A16_PREFILL_M 0
+ #endif // LLAMA_W4A16_PREFILL_M
+@@ -92,9 +100,7 @@ static __global__ void w4a16_grouped_kernel(
+         const nv_bfloat16 * __restrict__ Abf,            // [pad_rows, K] bf16
+         const block_nvfp4 * __restrict__ W0,             // src0 base (expert 0)
+         float * __restrict__ C,                          // [total_rows, N] f32
+-        const int * __restrict__ g_tile_expert,
+-        const int * __restrict__ g_tile_row0,
+-        const int * __restrict__ g_tile_rows,
+        const w4a16_tile_desc * __restrict__ g_tiles,
+         int N, int K, int64_t expert_stride_blocks) {
+ #if defined(AMPERE_MMA_AVAILABLE) && defined(CP_ASYNC_AVAILABLE)
+     constexpr int BK   = 64;                 // one nvfp4 block
+@@ -129,9 +135,10 @@ static __global__ void w4a16_grouped_kernel(
+     const int tid  = warp*32 + lane;         // linear id for the cp.async strided copies
+     const int wrow = warp / WARPS_N, wcol = warp % WARPS_N;
+ 
+-    const int    e        = g_tile_expert[blockIdx.y];
+-    const int    row0     = g_tile_row0[blockIdx.y];
+-    const int    rcount   = g_tile_rows[blockIdx.y];
+    const w4a16_tile_desc tile_desc = g_tiles[blockIdx.y];
+    const int    e        = tile_desc.expert;
+    const int    row0     = tile_desc.row0;
+    const int    rcount   = tile_desc.rows;
+     const int    blockCol = blockIdx.x*BN;
+     const int    Kb       = K/64;
+     const block_nvfp4 * We = W0 + (int64_t) e*expert_stride_blocks; // expert e weight base
+@@ -238,7 +245,7 @@ static __global__ void w4a16_grouped_kernel(
+         }
+ #else
+     GGML_UNUSED(Abf); GGML_UNUSED(W0); GGML_UNUSED(C);
+-    GGML_UNUSED(g_tile_expert); GGML_UNUSED(g_tile_row0); GGML_UNUSED(g_tile_rows);
+    GGML_UNUSED(g_tiles);
+     GGML_UNUSED(N); GGML_UNUSED(K); GGML_UNUSED(expert_stride_blocks);
+     NO_DEVICE_CODE;
+ #endif // AMPERE_MMA_AVAILABLE && CP_ASYNC_AVAILABLE
+@@ -296,18 +303,21 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
+         return;
+     }
+ 
+-    std::vector<int32_t> h_tile_expert, h_tile_row0, h_tile_rows;
+    std::vector<w4a16_tile_desc> h_tiles;
+     int64_t row = 0;
+     for (int64_t e = 0; e < n_experts; e++) {
+         const int t = tokens_per_expert[e];
+         for (int off = 0; off < t; off += BM) {
+-            h_tile_expert.push_back((int32_t) e);
+-            h_tile_row0.push_back((int32_t) (row + off));
+-            h_tile_rows.push_back((int32_t) std::min(BM, t - off));
+            h_tiles.push_back({
+                (int) e,
+                (int) (row + off),
+                (int) std::min(BM, t - off),
+                0,
+            });
+         }
+         row += t;
+     }
+-    const int n_tiles = (int) h_tile_expert.size();
+    const int n_tiles = (int) h_tiles.size();
+ 
+     if (getenv("LLAMA_W4A16_DEBUG")) {
+         int max_tpe = 0, multi = 0;
+@@ -319,13 +329,9 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
+                 (long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi);
+     }
+ 
+-    // device: tile map
+-    ggml_cuda_pool_alloc<int32_t> d_tile_expert(ctx.pool(), n_tiles);
+-    ggml_cuda_pool_alloc<int32_t> d_tile_row0  (ctx.pool(), n_tiles);
+-    ggml_cuda_pool_alloc<int32_t> d_tile_rows  (ctx.pool(), n_tiles);
+-    CUDA_CHECK(cudaMemcpyAsync(d_tile_expert.ptr, h_tile_expert.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
+-    CUDA_CHECK(cudaMemcpyAsync(d_tile_row0.ptr,   h_tile_row0.data(),   n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
+-    CUDA_CHECK(cudaMemcpyAsync(d_tile_rows.ptr,   h_tile_rows.data(),   n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
+    // device: packed tile map; one pageable H2D copy instead of three tiny copies
+    ggml_cuda_pool_alloc<w4a16_tile_desc> d_tiles(ctx.pool(), n_tiles);
+    CUDA_CHECK(cudaMemcpyAsync(d_tiles.ptr, h_tiles.data(), n_tiles*sizeof(w4a16_tile_desc), cudaMemcpyHostToDevice, stream));
+ 
+     // activations: f32 -> bf16 (cheap cast, NO act-quant), zero-padded so every tile's BM-row read
+     // stays in-bounds. A tile's row0 is generally NOT BM-aligned (experts start mid-buffer), and a
+@@ -353,7 +359,7 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
+     dim3 block(32, WARPS_M*WARPS_N);   // 2D: threadIdx.x = warp lane, threadIdx.y = warp
+     kern<<<grid, block, smem_bytes, stream>>>(
+         Abf.get(), (const block_nvfp4 *) src0->data, dst_sorted,
+-        d_tile_expert.ptr, d_tile_row0.ptr, d_tile_rows.ptr,
+        d_tiles.ptr,
+         (int) N, (int) K, expert_stride_blocks);
+     CUDA_CHECK(cudaGetLastError());
+ }
+-- 
+2.43.0
+