patches(paged): mirror W4A16 packed metadata

Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-06-30 21:21:53 +00:00
parent 1c0709b700
commit d8edc615e7
4 changed files with 208 additions and 5 deletions

View File

@@ -128,6 +128,49 @@ First implementation target:
many small ragged tile maps. The first fork-first experiment should remove or
amortize that host-built tile-map path before retuning MMA tile shapes.
## W4A16 Metadata Phase 1
Fork commit: `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17`
(`feat(paged): pack W4A16 grouped tile metadata`).
LocalAI patch mirror: `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`.
Mirror invariant: applying the full LocalAI `patches/paged/*.patch` series to
base pin `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
`4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17`.
Artifacts:
- Diff: `~/bench/w4a16_phase1/packed_desc.diff`
- Build mtimes: `~/bench/w4a16_phase1/build_binary_mtimes.txt`
- MoE gate: `~/bench/w4a16_phase1/gate_moe.md5`
- Dense gate: `~/bench/w4a16_phase1/gate_dense.md5`
- Default FP4-MMQ: `~/bench/w4a16_phase1/w4a16_off.txt`
- Packed W4A16: `~/bench/w4a16_phase1/w4a16_on_thr64.txt`
Canonical gates:
- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0` (matched expected)
- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439` (matched expected)
Packed descriptor A/B:
| Path | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|------|----|----|---|------|--------|----------|--------|----------|-----|-------|
| FP4-MMQ | 512 | 4 | 32 | 16512 | 7.114 | 2303.07 | 0.323 | 396.55 | 7.437 | 2220.32 |
| FP4-MMQ | 2048 | 4 | 32 | 65664 | 27.045 | 2423.23 | 0.331 | 387.14 | 27.376 | 2398.64 |
| W4A16 packed | 512 | 4 | 32 | 16512 | 12.468 | 1314.08 | 0.322 | 397.97 | 12.790 | 1291.04 |
| W4A16 packed | 2048 | 4 | 32 | 65664 | 48.930 | 1339.39 | 0.330 | 387.44 | 49.260 | 1333.00 |
Result:
- Packed descriptors improved forced W4A16 by `+0.39%` at `npp=512` and
`+0.48%` at `npp=2048` versus the Phase 0 no-debug W4A16 baseline.
- W4A16 remains `-42.9%` at `npp=512` and `-44.7%` at `npp=2048` versus
same-run default FP4-MMQ.
- Decision: keep patch `0048` as a small simplification, but pivot the next
W4A16 iteration to the activation cast or MMA/dequant tile body.
## Clean Build
First clean build attempt:
@@ -173,7 +216,8 @@ Second clean build attempt:
- HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
- Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
- Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
- LocalAI patch count: `38`
- LocalAI patch count: `38` at Phase 0; current mirror count is `39` after
patch `0048`.
- LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
HEAD.
- Tree hash after patch application: `a73d759350277532a14e853e1fe78f08bbb74ce8`

View File

@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `51168c5ee` "patch 0044" is byte-identical to worktree `0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `4b0cc1163` is mirrored by the new worktree `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)
### 2.6 Bench hygiene gates
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
## 7. KEY FILE / ARTIFACT INDEX
### Fork (canonical source of truth)
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, HEAD `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` ("fused gated RMSNorm + SiLU gate-mul CUDA op (patch 0044)"). **Currently dirty** (uncommitted `M ggml/src/ggml-cuda/gated_delta_net.cu`).
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17` ("pack W4A16 grouped tile metadata", patch `0048`). The DGX checkout itself may still be dirty and must not be treated as canonical.
- `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.
### LocalAI worktree
- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **38** `.patch` files spanning 0001-0047 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV).
- `patches/paged/`: **39** `.patch` files spanning 0001-0048 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048.
### Bench artifacts (DGX)
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the fork HEAD (`51168c5ee`, patch 0044) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`4b0cc1163`, patch 0048) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.

View File

@@ -0,0 +1,121 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 30 Jun 2026 21:19:07 +0000
Subject: [PATCH] feat(paged): pack W4A16 grouped tile metadata
Collapse the grouped W4A16 tile metadata from three host vectors and three device copies into one aligned descriptor array and one H2D copy. This keeps the path default-off while reducing launch-side metadata overhead for the experimental W4A16 MoE prefill path.
Assisted-by: Codex:gpt-5
---
ggml/src/ggml-cuda/w4a16-gemm.cu | 46 ++++++++++++++++++--------------
1 file changed, 26 insertions(+), 20 deletions(-)
diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
index f348f31ad..899e1a23f 100644
--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
@@ -18,6 +18,14 @@ typedef tile<16, 8, nv_bfloat162> tile_A; // A operand: M=16, K=16
typedef tile< 8, 8, nv_bfloat162> tile_B; // B operand: N=8, K=16
typedef tile<16, 8, float> tile_C; // accumulator: M=16, N=8
+struct alignas(16) w4a16_tile_desc {
+ int expert;
+ int row0;
+ int rows;
+ int pad;
+};
+static_assert(sizeof(w4a16_tile_desc) == 16, "w4a16 tile descriptors must stay 16 bytes");
+
#ifndef LLAMA_W4A16_PREFILL_M
#define LLAMA_W4A16_PREFILL_M 0
#endif // LLAMA_W4A16_PREFILL_M
@@ -92,9 +100,7 @@ static __global__ void w4a16_grouped_kernel(
const nv_bfloat16 * __restrict__ Abf, // [pad_rows, K] bf16
const block_nvfp4 * __restrict__ W0, // src0 base (expert 0)
float * __restrict__ C, // [total_rows, N] f32
- const int * __restrict__ g_tile_expert,
- const int * __restrict__ g_tile_row0,
- const int * __restrict__ g_tile_rows,
+ const w4a16_tile_desc * __restrict__ g_tiles,
int N, int K, int64_t expert_stride_blocks) {
#if defined(AMPERE_MMA_AVAILABLE) && defined(CP_ASYNC_AVAILABLE)
constexpr int BK = 64; // one nvfp4 block
@@ -129,9 +135,10 @@ static __global__ void w4a16_grouped_kernel(
const int tid = warp*32 + lane; // linear id for the cp.async strided copies
const int wrow = warp / WARPS_N, wcol = warp % WARPS_N;
- const int e = g_tile_expert[blockIdx.y];
- const int row0 = g_tile_row0[blockIdx.y];
- const int rcount = g_tile_rows[blockIdx.y];
+ const w4a16_tile_desc tile_desc = g_tiles[blockIdx.y];
+ const int e = tile_desc.expert;
+ const int row0 = tile_desc.row0;
+ const int rcount = tile_desc.rows;
const int blockCol = blockIdx.x*BN;
const int Kb = K/64;
const block_nvfp4 * We = W0 + (int64_t) e*expert_stride_blocks; // expert e weight base
@@ -238,7 +245,7 @@ static __global__ void w4a16_grouped_kernel(
}
#else
GGML_UNUSED(Abf); GGML_UNUSED(W0); GGML_UNUSED(C);
- GGML_UNUSED(g_tile_expert); GGML_UNUSED(g_tile_row0); GGML_UNUSED(g_tile_rows);
+ GGML_UNUSED(g_tiles);
GGML_UNUSED(N); GGML_UNUSED(K); GGML_UNUSED(expert_stride_blocks);
NO_DEVICE_CODE;
#endif // AMPERE_MMA_AVAILABLE && CP_ASYNC_AVAILABLE
@@ -296,18 +303,21 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
return;
}
- std::vector<int32_t> h_tile_expert, h_tile_row0, h_tile_rows;
+ std::vector<w4a16_tile_desc> h_tiles;
int64_t row = 0;
for (int64_t e = 0; e < n_experts; e++) {
const int t = tokens_per_expert[e];
for (int off = 0; off < t; off += BM) {
- h_tile_expert.push_back((int32_t) e);
- h_tile_row0.push_back((int32_t) (row + off));
- h_tile_rows.push_back((int32_t) std::min(BM, t - off));
+ h_tiles.push_back({
+ (int) e,
+ (int) (row + off),
+ (int) std::min(BM, t - off),
+ 0,
+ });
}
row += t;
}
- const int n_tiles = (int) h_tile_expert.size();
+ const int n_tiles = (int) h_tiles.size();
if (getenv("LLAMA_W4A16_DEBUG")) {
int max_tpe = 0, multi = 0;
@@ -319,13 +329,9 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
(long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi);
}
- // device: tile map
- ggml_cuda_pool_alloc<int32_t> d_tile_expert(ctx.pool(), n_tiles);
- ggml_cuda_pool_alloc<int32_t> d_tile_row0 (ctx.pool(), n_tiles);
- ggml_cuda_pool_alloc<int32_t> d_tile_rows (ctx.pool(), n_tiles);
- CUDA_CHECK(cudaMemcpyAsync(d_tile_expert.ptr, h_tile_expert.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
- CUDA_CHECK(cudaMemcpyAsync(d_tile_row0.ptr, h_tile_row0.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
- CUDA_CHECK(cudaMemcpyAsync(d_tile_rows.ptr, h_tile_rows.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
+ // device: packed tile map; one pageable H2D copy instead of three tiny copies
+ ggml_cuda_pool_alloc<w4a16_tile_desc> d_tiles(ctx.pool(), n_tiles);
+ CUDA_CHECK(cudaMemcpyAsync(d_tiles.ptr, h_tiles.data(), n_tiles*sizeof(w4a16_tile_desc), cudaMemcpyHostToDevice, stream));
// activations: f32 -> bf16 (cheap cast, NO act-quant), zero-padded so every tile's BM-row read
// stays in-bounds. A tile's row0 is generally NOT BM-aligned (experts start mid-buffer), and a
@@ -353,7 +359,7 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
dim3 block(32, WARPS_M*WARPS_N); // 2D: threadIdx.x = warp lane, threadIdx.y = warp
kern<<<grid, block, smem_bytes, stream>>>(
Abf.get(), (const block_nvfp4 *) src0->data, dst_sorted,
- d_tile_expert.ptr, d_tile_row0.ptr, d_tile_rows.ptr,
+ d_tiles.ptr,
(int) N, (int) K, expert_stride_blocks);
CUDA_CHECK(cudaGetLastError());
}
--
2.43.0