mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
patches(paged): pad W4A16 A shared tile stride
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -270,6 +270,47 @@ Result:
|
||||
- Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or
|
||||
scheduling cost exceeds the saved duplicate scale conversion.
|
||||
|
||||
## W4A16 Shared-Memory Padding Phase 4
|
||||
|
||||
Goal: reduce bank pressure in `w4a16_grouped_kernel` by padding the A operand
|
||||
shared-memory row stride while preserving math order and launch shape.
|
||||
|
||||
Fork commit: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
|
||||
(`feat(paged): pad W4A16 A shared tile stride`).
|
||||
|
||||
LocalAI patch mirror: `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- Build: `~/llama-w4a16-phase4`
|
||||
- Logs: `~/bench/w4a16_phase4`
|
||||
|
||||
Gates:
|
||||
|
||||
- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Forced W4A16 `bm32` and old `base` shape md5s matched each other:
|
||||
`07db32c2bcb78d17a43ed18bc22705cd`.
|
||||
- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0.
|
||||
|
||||
Performance:
|
||||
|
||||
| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
|
||||
|-------|--------------|---------------|----------|
|
||||
| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline |
|
||||
| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected |
|
||||
| Phase 2 `base` | 1310.13 | 1336.02 | baseline |
|
||||
| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic |
|
||||
|
||||
Result:
|
||||
|
||||
- Kept. Default W4A16 `bm32` improves another `+1.7%` at `npp=512` and
|
||||
`+1.6%` at `npp=2048` versus Phase 2.
|
||||
- Applying all 41 LocalAI `patches/paged/*.patch` files to base pin
|
||||
`0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
|
||||
`d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`.
|
||||
- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`.
|
||||
|
||||
## Clean Build
|
||||
|
||||
First clean build attempt:
|
||||
@@ -311,16 +352,16 @@ Second clean build attempt:
|
||||
|
||||
- Local llama.cpp fork: `/home/mudler/_git/llama.cpp`
|
||||
- Branch: `localai-paged`
|
||||
- Working tree: clean after fork commit `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
|
||||
- Working tree: clean after fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
|
||||
- Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
|
||||
- Current HEAD: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
|
||||
- Current HEAD: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
|
||||
- Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
|
||||
- Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
|
||||
- LocalAI patch count: `38` at Phase 0; current mirror count is `40` after
|
||||
patch `0049`.
|
||||
- LocalAI patch count: `38` at Phase 0; current mirror count is `41` after
|
||||
patch `0050`.
|
||||
- LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
|
||||
HEAD.
|
||||
- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`
|
||||
- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`
|
||||
|
||||
## Existing Artifact Gap Review
|
||||
|
||||
|
||||
@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
|
||||
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
|
||||
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
|
||||
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `7dfa0e175` is mirrored by worktree patch `0049-feat-paged-tune-W4A16-grouped-tile-shape.patch`; W4A16 packed metadata is worktree patch `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `d9b9be0be` is mirrored by worktree patch `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`; W4A16 grouped tile shape is worktree patch `0049`, packed metadata is `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)
|
||||
|
||||
### 2.6 Bench hygiene gates
|
||||
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
|
||||
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
## 7. KEY FILE / ARTIFACT INDEX
|
||||
|
||||
### Fork (canonical source of truth)
|
||||
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` ("tune W4A16 grouped tile shape", patch `0049`). The DGX checkout itself may still be dirty and must not be treated as canonical.
|
||||
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` ("pad W4A16 A shared tile stride", patch `0050`). The DGX checkout itself may still be dirty and must not be treated as canonical.
|
||||
- `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.
|
||||
|
||||
### LocalAI worktree
|
||||
- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
|
||||
- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
|
||||
- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
|
||||
- `patches/paged/`: **40** `.patch` files spanning 0001-0049 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default.
|
||||
- `patches/paged/`: **41** `.patch` files spanning 0001-0050 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default; W4A16 A shared-tile padding is 0050.
|
||||
|
||||
### Bench artifacts (DGX)
|
||||
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
|
||||
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
|
||||
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
|
||||
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
|
||||
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`7dfa0e175`, patch 0049) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
|
||||
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`d9b9be0be`, patch 0050) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
|
||||
3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
|
||||
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
|
||||
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
|
||||
|
||||
@@ -0,0 +1,56 @@
|
||||
From d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 30 Jun 2026 22:13:09 +0000
|
||||
Subject: [PATCH] feat(paged): pad W4A16 A shared tile stride
|
||||
|
||||
Pad the grouped W4A16 A operand shared-memory row stride to reduce bank pressure on GB10 while preserving the selected bm32/default and base launch shapes.
|
||||
|
||||
Assisted-by: Codex:gpt-5
|
||||
---
|
||||
ggml/src/ggml-cuda/w4a16-gemm.cu | 9 +++++----
|
||||
1 file changed, 5 insertions(+), 4 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
index ca8864292..b17e022b0 100644
|
||||
--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
@@ -111,7 +111,8 @@ static __global__ void w4a16_grouped_kernel(
|
||||
constexpr int MF = WM/16, NF = WN/8;
|
||||
|
||||
constexpr int AN = BK/2; // bf16 pairs per A smem row (nv_bfloat162)
|
||||
- constexpr int SZ_A = BM*AN; // nv_bfloat162 per stage
|
||||
+ constexpr int ASTR = AN + 4; // padded A smem row stride, in nv_bfloat162
|
||||
+ constexpr int SZ_A = BM*ASTR; // nv_bfloat162 per stage
|
||||
constexpr int SZ_WQ = BN*8; // u32 per stage (32 qs bytes/row)
|
||||
constexpr int SZ_WD = BN; // u32 per stage (4 scale bytes/row)
|
||||
|
||||
@@ -155,7 +156,7 @@ static __global__ void w4a16_grouped_kernel(
|
||||
const int c = idx % (BK/8); // 16B chunk in the row
|
||||
const int r = idx / (BK/8); // row in tile
|
||||
const nv_bfloat16 * src = Abf + (int64_t)(row0 + r)*K + (int64_t)kt*BK + c*8;
|
||||
- w4a16_cp_async16(((char *) sA[st]) + (r*AN + c*4)*sizeof(uint32_t), src);
|
||||
+ w4a16_cp_async16(((char *) sA[st]) + (r*ASTR + c*4)*sizeof(uint32_t), src);
|
||||
}
|
||||
// W qs: BN rows x 32 bytes = BN x 8 u32 (each block's qs at byte offset 4)
|
||||
#pragma unroll 1
|
||||
@@ -198,7 +199,7 @@ static __global__ void w4a16_grouped_kernel(
|
||||
#pragma unroll
|
||||
for (int mi = 0; mi < MF; mi++) {
|
||||
const int rb = wrow*WM + mi*16;
|
||||
- load_ldmatrix(A_frag[mi], sAcur + rb*AN + kk*8, AN);
|
||||
+ load_ldmatrix(A_frag[mi], sAcur + rb*ASTR + kk*8, ASTR);
|
||||
}
|
||||
// B fragments: in-register FP4->bf16 dequant (correct-by-construction via tile get_i/get_j)
|
||||
tile_B B_frag[NF];
|
||||
@@ -368,7 +369,7 @@ static void ggml_cuda_mul_mat_id_w4a16_grouped_impl(
|
||||
const int64_t expert_stride_blocks = (int64_t) (src0->nb[2] / sizeof(block_nvfp4));
|
||||
|
||||
auto kern = w4a16_grouped_kernel<BM, BN, WARPS_M, WARPS_N, STAGES>;
|
||||
- constexpr int STAGE_U32 = BM*(64/2) + BN*8 + BN;
|
||||
+ constexpr int STAGE_U32 = BM*((64/2) + 4) + BN*8 + BN;
|
||||
const int smem_bytes = STAGES * STAGE_U32 * (int) sizeof(uint32_t);
|
||||
CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes));
|
||||
|
||||
--
|
||||
2.43.0
|
||||
|
||||
Reference in New Issue
Block a user