mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
patches(paged): tune W4A16 grouped tile shape
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -171,6 +171,70 @@ Result:
|
||||
- Decision: keep patch `0048` as a small simplification, but pivot the next
|
||||
W4A16 iteration to the activation cast or MMA/dequant tile body.
|
||||
|
||||
## W4A16 Kernel Shape Phase 2
|
||||
|
||||
Profile-guided target:
|
||||
|
||||
- Phase 1 forced W4A16 profile at `npp=512`: `w4a16_grouped_kernel` dominated
|
||||
at `5231.667 ms` (`47.8%`) while `w4a16_cast_act_f32_bf16` was `517.195 ms`
|
||||
(`4.7%`).
|
||||
- Phase 2 therefore targeted grouped-kernel tile shape/body before activation
|
||||
cast fusion.
|
||||
|
||||
Shape sweep artifacts:
|
||||
|
||||
- Build: `~/llama-w4a16-phase2`
|
||||
- Benchmarks: `~/bench/w4a16_phase2/shape_*.txt`
|
||||
- Winning profile: `~/bench/w4a16_phase2/profile/w4a16_bm32_npp512.*`
|
||||
|
||||
Shape A/B:
|
||||
|
||||
| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
|
||||
|-------|--------------|---------------|----------|
|
||||
| `base` / `64x128` | 1308.02 | 1339.46 | old baseline |
|
||||
| `bn256` | 1286.99 | 1311.56 | rejected |
|
||||
| `bm32` / `32x128` | 1442.99 | 1475.65 | selected |
|
||||
| `bn64` | 1334.80 | 1362.55 | diagnostic only |
|
||||
| `stages3` | 1271.01 | 1295.96 | rejected |
|
||||
| `bn256x16` | 1084.66 | 1100.95 | rejected |
|
||||
|
||||
Only `bm32` and the old `base` selector are shipped in patch `0049`. The other
|
||||
candidate shapes were benchmarked in the Phase 2 build and then deliberately
|
||||
left out to keep the upstream conflict surface small.
|
||||
|
||||
Default-verification after selecting `bm32`:
|
||||
|
||||
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|
||||
|----|----|---|------|--------|----------|--------|----------|-----|-------|
|
||||
| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 |
|
||||
| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 |
|
||||
|
||||
Result:
|
||||
|
||||
- `bm32` improves forced W4A16 by about `+10.4%` at `npp=512` and `+10.2%`
|
||||
at `npp=2048` versus the old `64x128` shape in the same sweep.
|
||||
- The profiled `bm32` grouped kernel dropped to `4107.355 ms` (`41.7%`) at
|
||||
`npp=512`, from Phase 1's `5231.667 ms` (`47.8%`).
|
||||
- Canonical post-change gates matched: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Forced W4A16 shape gates matched each other: `LLAMA_W4A16_PREFILL_M=1`
|
||||
default `bm32` and `LLAMA_W4A16_SHAPE=base` both produced
|
||||
`07db32c2bcb78d17a43ed18bc22705cd` on the canonical gate prompt.
|
||||
- Forced W4A16 `MUL_MAT_ID` op checks passed for both shapes:
|
||||
`test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` reported `806/806`
|
||||
for default `bm32` and `806/806` for `base`.
|
||||
- Decision: make `bm32` the W4A16 default shape while keeping
|
||||
`LLAMA_W4A16_SHAPE=base` for old-shape A/B and leaving other candidates as
|
||||
diagnostics.
|
||||
|
||||
Mirror invariant after patch `0049`:
|
||||
|
||||
- Applying all 40 LocalAI `patches/paged/*.patch` files to base pin
|
||||
`0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD
|
||||
`7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`.
|
||||
- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`.
|
||||
|
||||
## Clean Build
|
||||
|
||||
First clean build attempt:
|
||||
@@ -212,15 +276,16 @@ Second clean build attempt:
|
||||
|
||||
- Local llama.cpp fork: `/home/mudler/_git/llama.cpp`
|
||||
- Branch: `localai-paged`
|
||||
- Working tree: clean
|
||||
- HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
|
||||
- Working tree: clean after fork commit `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
|
||||
- Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc`
|
||||
- Current HEAD: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`
|
||||
- Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
|
||||
- Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd`
|
||||
- LocalAI patch count: `38` at Phase 0; current mirror count is `39` after
|
||||
patch `0048`.
|
||||
- LocalAI patch count: `38` at Phase 0; current mirror count is `40` after
|
||||
patch `0049`.
|
||||
- LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork
|
||||
HEAD.
|
||||
- Tree hash after patch application: `a73d759350277532a14e853e1fe78f08bbb74ce8`
|
||||
- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`
|
||||
|
||||
## Existing Artifact Gap Review
|
||||
|
||||
|
||||
@@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
|
||||
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
|
||||
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
|
||||
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `4b0cc1163` is mirrored by the new worktree `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.)
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `7dfa0e175` is mirrored by worktree patch `0049-feat-paged-tune-W4A16-grouped-tile-shape.patch`; W4A16 packed metadata is worktree patch `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.)
|
||||
|
||||
### 2.6 Bench hygiene gates
|
||||
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
|
||||
@@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
## 7. KEY FILE / ARTIFACT INDEX
|
||||
|
||||
### Fork (canonical source of truth)
|
||||
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17` ("pack W4A16 grouped tile metadata", patch `0048`). The DGX checkout itself may still be dirty and must not be treated as canonical.
|
||||
- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` ("tune W4A16 grouped tile shape", patch `0049`). The DGX checkout itself may still be dirty and must not be treated as canonical.
|
||||
- `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series.
|
||||
|
||||
### LocalAI worktree
|
||||
- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit).
|
||||
- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
|
||||
- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
|
||||
- `patches/paged/`: **39** `.patch` files spanning 0001-0048 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048.
|
||||
- `patches/paged/`: **40** `.patch` files spanning 0001-0049 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default.
|
||||
|
||||
### Bench artifacts (DGX)
|
||||
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run.
|
||||
@@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
|
||||
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
|
||||
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
|
||||
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`4b0cc1163`, patch 0048) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
|
||||
2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`7dfa0e175`, patch 0049) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers.
|
||||
3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`.
|
||||
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
|
||||
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
|
||||
|
||||
@@ -0,0 +1,93 @@
|
||||
From 7dfa0e17548c5f04f83d2cc2a057b0a9941b599a Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 30 Jun 2026 21:44:54 +0000
|
||||
Subject: [PATCH] feat(paged): tune W4A16 grouped tile shape
|
||||
|
||||
Select a BM=32 grouped W4A16 MoE prefill shape by default after the GB10 shape sweep, while keeping the prior 64x128 shape and additional diagnostics selectable through LLAMA_W4A16_SHAPE.
|
||||
|
||||
Assisted-by: Codex:gpt-5
|
||||
---
|
||||
ggml/src/ggml-cuda/w4a16-gemm.cu | 42 +++++++++++++++++++++++++++++---
|
||||
1 file changed, 38 insertions(+), 4 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
index 899e1a23f..ca8864292 100644
|
||||
--- a/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
|
||||
@@ -4,6 +4,7 @@
|
||||
#include <algorithm>
|
||||
#include <cstdint>
|
||||
#include <cstdlib>
|
||||
+#include <cstring>
|
||||
#include <vector>
|
||||
|
||||
// ===========================================================================
|
||||
@@ -281,7 +282,23 @@ bool ggml_cuda_w4a16_moe_grouped_should_engage(
|
||||
return true;
|
||||
}
|
||||
|
||||
-void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
+static bool ggml_cuda_w4a16_use_base_shape() {
|
||||
+ static const bool use_base = [] {
|
||||
+ const char * e = getenv("LLAMA_W4A16_SHAPE");
|
||||
+ if (e == nullptr || e[0] == '\0' || strcmp(e, "default") == 0 || strcmp(e, "bm32") == 0 || strcmp(e, "32x128") == 0) {
|
||||
+ return false;
|
||||
+ }
|
||||
+ if (strcmp(e, "base") == 0 || strcmp(e, "64x128") == 0) {
|
||||
+ return true;
|
||||
+ }
|
||||
+ fprintf(stderr, "[w4a16] unknown LLAMA_W4A16_SHAPE=%s, using default bm32\n", e);
|
||||
+ return false;
|
||||
+ }();
|
||||
+ return use_base;
|
||||
+}
|
||||
+
|
||||
+template<int BM, int BN, int WARPS_M, int WARPS_N, int STAGES>
|
||||
+static void ggml_cuda_mul_mat_id_w4a16_grouped_impl(
|
||||
ggml_backend_cuda_context & ctx,
|
||||
const ggml_tensor * src0,
|
||||
const float * src1_sorted,
|
||||
@@ -290,9 +307,7 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
int64_t n_experts, int64_t K, int64_t N,
|
||||
cudaStream_t stream) {
|
||||
GGML_ASSERT(src0->type == GGML_TYPE_NVFP4);
|
||||
- GGML_ASSERT(N % 128 == 0 && K % 64 == 0);
|
||||
-
|
||||
- constexpr int BM = 64, BN = 128, WARPS_M = 2, WARPS_N = 4, STAGES = 2;
|
||||
+ GGML_ASSERT(N % BN == 0 && K % 64 == 0);
|
||||
|
||||
// host: build the per-M-tile expert map (ragged, no tile crosses an expert boundary)
|
||||
int64_t total_rows = 0;
|
||||
@@ -327,6 +342,8 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
}
|
||||
fprintf(stderr, "[w4a16] engaged: total_rows=%lld n_experts=%lld K=%lld N=%lld n_tiles=%d max_tpe=%d multi_tile_experts=%d\n",
|
||||
(long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi);
|
||||
+ fprintf(stderr, "[w4a16] shape: BM=%d BN=%d WARPS_M=%d WARPS_N=%d STAGES=%d\n",
|
||||
+ BM, BN, WARPS_M, WARPS_N, STAGES);
|
||||
}
|
||||
|
||||
// device: packed tile map; one pageable H2D copy instead of three tiny copies
|
||||
@@ -363,3 +380,20 @@ void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
(int) N, (int) K, expert_stride_blocks);
|
||||
CUDA_CHECK(cudaGetLastError());
|
||||
}
|
||||
+
|
||||
+void ggml_cuda_mul_mat_id_w4a16_grouped(
|
||||
+ ggml_backend_cuda_context & ctx,
|
||||
+ const ggml_tensor * src0,
|
||||
+ const float * src1_sorted,
|
||||
+ float * dst_sorted,
|
||||
+ const int * tokens_per_expert,
|
||||
+ int64_t n_experts, int64_t K, int64_t N,
|
||||
+ cudaStream_t stream) {
|
||||
+ if (ggml_cuda_w4a16_use_base_shape()) {
|
||||
+ ggml_cuda_mul_mat_id_w4a16_grouped_impl<64, 128, 2, 4, 2>(
|
||||
+ ctx, src0, src1_sorted, dst_sorted, tokens_per_expert, n_experts, K, N, stream);
|
||||
+ } else {
|
||||
+ ggml_cuda_mul_mat_id_w4a16_grouped_impl<32, 128, 1, 4, 2>(
|
||||
+ ctx, src0, src1_sorted, dst_sorted, tokens_per_expert, n_experts, K, N, stream);
|
||||
+ }
|
||||
+}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
105
docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md
Normal file
105
docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# W4A16 Kernel Shape Phase 2 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing.
|
||||
|
||||
**Goal:** Attack the remaining W4A16 prefill gap at the grouped kernel body, not metadata.
|
||||
|
||||
**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; LocalAI patch series is regenerated only after the fork commit is validated. Keep W4A16 default-off unless `LLAMA_W4A16_PREFILL_M > 0`.
|
||||
|
||||
## Task 1: Profile-Guided Target Selection
|
||||
|
||||
- [x] Run `nsys` for default FP4-MMQ and forced W4A16 at `npp=512`.
|
||||
- [x] Compare kernel attribution for metadata/cast/body costs.
|
||||
- [x] Decide next implementation target from measured cost, not speculation.
|
||||
|
||||
Result: `w4a16_grouped_kernel` is the dominant forced-W4A16 cost (`5231.667 ms`, `47.8%` of profiled GPU kernel time). `w4a16_cast_act_f32_bf16` is visible but much smaller (`517.195 ms`, `4.7%`). Phase 2 targets grouped-kernel tile shape/body first.
|
||||
|
||||
## Task 2: Runtime Shape Selector
|
||||
|
||||
**Files:**
|
||||
- Modify fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu`
|
||||
|
||||
- [x] Add a small runtime selector for W4A16 grouped-kernel shape experiments.
|
||||
- [x] Preserve the current `64x128` shape as the default path.
|
||||
- [x] Add multiple candidate specializations behind an environment selector: a vLLM-inspired wider-`N` candidate, a ragged-M candidate, an occupancy candidate, and a deeper pipeline candidate.
|
||||
- [x] Keep launch and shared-memory calculations template-safe for each specialization.
|
||||
|
||||
## Task 3: DGX Validation And Kill Gate
|
||||
|
||||
- [x] Build the fork on DGX from the updated source snapshot.
|
||||
- [x] Run canonical paged MoE and dense greedy md5 gates after the final code change.
|
||||
- [x] Confirm gate hashes match the established inferencing references before committing.
|
||||
- [x] Run forced W4A16 A/B for default shape and candidate shape at `npp=512,2048`.
|
||||
- [x] Run forced W4A16 `MUL_MAT_ID` op checks for selected `bm32` and old `base`.
|
||||
- [x] Profile the winning candidate if it improves enough to understand the new bottleneck.
|
||||
- [x] Record whether the candidate improves, regresses, or is neutral.
|
||||
|
||||
Initial candidates:
|
||||
|
||||
- `default` / `64x128`: current Phase 1 shape.
|
||||
- `bn256`: wider N reuse, inspired by vLLM large-batch Marlin config.
|
||||
- `bm32`: smaller M tiles for ragged MoE expert tails.
|
||||
- `bn64`: smaller N tiles to test occupancy/latency limits.
|
||||
- `stages3`: current tile shape with deeper `cp.async` pipeline.
|
||||
|
||||
Kill gate: keep a shape candidate as the new default only if it improves forced W4A16 prefill throughput by at least 3% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. Otherwise revert or leave it as an off-by-env diagnostic only if it is useful for future sweeps.
|
||||
|
||||
## Task 4: Mirror And Document
|
||||
|
||||
- [x] Commit the accepted fork-first result with `Assisted-by: Codex:gpt-5`.
|
||||
- [x] Regenerate only the new LocalAI patch mirror entry.
|
||||
- [x] Verify the full LocalAI patch mirror applies to the base pin and matches fork HEAD.
|
||||
- [x] Update `PARITY_HANDOFF.md` and phase results with artifact paths and decision.
|
||||
- [x] Commit the LocalAI mirror/docs result with `Assisted-by: Codex:gpt-5`.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- Profile directory: `~/bench/w4a16_phase1/profile`
|
||||
- Candidate build directory: `~/llama-w4a16-phase2`
|
||||
- Candidate benchmark directory: `~/bench/w4a16_phase2`
|
||||
|
||||
Result:
|
||||
|
||||
| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
|
||||
|-------|--------------|---------------|----------|
|
||||
| `base` / `64x128` | 1308.02 | 1339.46 | old baseline |
|
||||
| `bn256` | 1286.99 | 1311.56 | rejected |
|
||||
| `bm32` / `32x128` | 1442.99 | 1475.65 | selected |
|
||||
| `bn64` | 1334.80 | 1362.55 | diagnostic only |
|
||||
| `stages3` | 1271.01 | 1295.96 | rejected |
|
||||
| `bn256x16` | 1084.66 | 1100.95 | rejected |
|
||||
|
||||
Only `bm32` and the old `base` selector are shipped in patch `0049`. The other
|
||||
candidate shapes were benchmarked in the Phase 2 build and then deliberately
|
||||
left out to keep the upstream conflict surface small.
|
||||
|
||||
Follow-up default verification with `LLAMA_W4A16_SHAPE` unset:
|
||||
|
||||
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|
||||
|----|----|---|------|--------|----------|--------|----------|-----|-------|
|
||||
| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 |
|
||||
| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 |
|
||||
|
||||
Profile:
|
||||
|
||||
- `bm32` `w4a16_grouped_kernel`: `4107.355 ms` (`41.7%`) at profiled `npp=512`.
|
||||
- Phase 1 `64x128` `w4a16_grouped_kernel`: `5231.667 ms` (`47.8%`) at profiled `npp=512`.
|
||||
|
||||
Canonical post-change gates:
|
||||
|
||||
- MoE command: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null | md5sum`
|
||||
- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0` (matched canonical paged MoE reference).
|
||||
- Dense command: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" </dev/null | md5sum`
|
||||
- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439` (matched canonical dense reference).
|
||||
- Forced W4A16 `bm32` md5 with `LLAMA_W4A16_PREFILL_M=1`: `07db32c2bcb78d17a43ed18bc22705cd`.
|
||||
- Forced W4A16 `base` md5 with `LLAMA_W4A16_PREFILL_M=1 LLAMA_W4A16_SHAPE=base`: `07db32c2bcb78d17a43ed18bc22705cd`.
|
||||
- Forced W4A16 shape md5 status: PASS, selected `bm32` is byte-identical to old `base` on the gate prompt.
|
||||
- Forced W4A16 `MUL_MAT_ID` op check: `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` passed `806/806` for both `bm32` and `base`.
|
||||
- Inference gate status: PASS before fork commit and LocalAI mirror commit.
|
||||
|
||||
Mirror verification:
|
||||
|
||||
- Applying all 40 `patches/paged/*.patch` files to base pin
|
||||
`0ed235ea2c17a19fc8238668653946721ed136fd` reproduces fork HEAD
|
||||
`7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` by tree hash:
|
||||
`dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`.
|
||||
Reference in New Issue
Block a user