diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index b2694e14c..e41ce8c31 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -171,6 +171,70 @@ Result: - Decision: keep patch `0048` as a small simplification, but pivot the next W4A16 iteration to the activation cast or MMA/dequant tile body. +## W4A16 Kernel Shape Phase 2 + +Profile-guided target: + +- Phase 1 forced W4A16 profile at `npp=512`: `w4a16_grouped_kernel` dominated + at `5231.667 ms` (`47.8%`) while `w4a16_cast_act_f32_bf16` was `517.195 ms` + (`4.7%`). +- Phase 2 therefore targeted grouped-kernel tile shape/body before activation + cast fusion. + +Shape sweep artifacts: + +- Build: `~/llama-w4a16-phase2` +- Benchmarks: `~/bench/w4a16_phase2/shape_*.txt` +- Winning profile: `~/bench/w4a16_phase2/profile/w4a16_bm32_npp512.*` + +Shape A/B: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| `base` / `64x128` | 1308.02 | 1339.46 | old baseline | +| `bn256` | 1286.99 | 1311.56 | rejected | +| `bm32` / `32x128` | 1442.99 | 1475.65 | selected | +| `bn64` | 1334.80 | 1362.55 | diagnostic only | +| `stages3` | 1271.01 | 1295.96 | rejected | +| `bn256x16` | 1084.66 | 1100.95 | rejected | + +Only `bm32` and the old `base` selector are shipped in patch `0049`. The other +candidate shapes were benchmarked in the Phase 2 build and then deliberately +left out to keep the upstream conflict surface small. + +Default-verification after selecting `bm32`: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 | +| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 | + +Result: + +- `bm32` improves forced W4A16 by about `+10.4%` at `npp=512` and `+10.2%` + at `npp=2048` versus the old `64x128` shape in the same sweep. +- The profiled `bm32` grouped kernel dropped to `4107.355 ms` (`41.7%`) at + `npp=512`, from Phase 1's `5231.667 ms` (`47.8%`). +- Canonical post-change gates matched: MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 shape gates matched each other: `LLAMA_W4A16_PREFILL_M=1` + default `bm32` and `LLAMA_W4A16_SHAPE=base` both produced + `07db32c2bcb78d17a43ed18bc22705cd` on the canonical gate prompt. +- Forced W4A16 `MUL_MAT_ID` op checks passed for both shapes: + `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` reported `806/806` + for default `bm32` and `806/806` for `base`. +- Decision: make `bm32` the W4A16 default shape while keeping + `LLAMA_W4A16_SHAPE=base` for old-shape A/B and leaving other candidates as + diagnostics. + +Mirror invariant after patch `0049`: + +- Applying all 40 LocalAI `patches/paged/*.patch` files to base pin + `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD + `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`. +- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`. + ## Clean Build First clean build attempt: @@ -212,15 +276,16 @@ Second clean build attempt: - Local llama.cpp fork: `/home/mudler/_git/llama.cpp` - Branch: `localai-paged` -- Working tree: clean -- HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` +- Working tree: clean after fork commit `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` +- Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` +- Current HEAD: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` - Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd` - Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd` -- LocalAI patch count: `38` at Phase 0; current mirror count is `39` after - patch `0048`. +- LocalAI patch count: `38` at Phase 0; current mirror count is `40` after + patch `0049`. - LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork HEAD. -- Tree hash after patch application: `a73d759350277532a14e853e1fe78f08bbb74ce8` +- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9` ## Existing Artifact Gap Review diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index a80e9f4f9..b219852fa 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -58,7 +58,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export. - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source. - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated. -- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `4b0cc1163` is mirrored by the new worktree `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`; the f32-only M5 tensor-core scan is worktree patch `0047`.) +- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. (Concretely: fork HEAD `7dfa0e175` is mirrored by worktree patch `0049-feat-paged-tune-W4A16-grouped-tile-shape.patch`; W4A16 packed metadata is worktree patch `0048`, and the f32-only M5 tensor-core scan is worktree patch `0047`.) ### 2.6 Bench hygiene gates - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS"). @@ -253,14 +253,14 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ## 7. KEY FILE / ARTIFACT INDEX ### Fork (canonical source of truth) -- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17` ("pack W4A16 grouped tile metadata", patch `0048`). The DGX checkout itself may still be dirty and must not be treated as canonical. +- `dgx:~/llama-paged-fork`, remote `fork git@github.com:mudler/llama.cpp.git`, branch **`localai-paged`**, last clean local canonical HEAD `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a` ("tune W4A16 grouped tile shape", patch `0049`). The DGX checkout itself may still be dirty and must not be treated as canonical. - `dgx:~/llama-paged-dev` (experimental dev/build tree), branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). **Dirty** + many untracked profiling artifacts. This tree's `build-cuda/bin/` produced the benchmarked binaries; `COMBINED_DEFINITIVE` recorded `GIT_HEAD=a7d439e` (the M8 bf16 dev config), NOT the fork HEAD. The dev tree carries bf16/hybrid M6/M7/M8 machinery deliberately EXCLUDED from the shipped f32-only series. ### LocalAI worktree - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (199 ahead, 25 behind origin/master; the ahead count grows with each new commit). - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`). - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`. -- `patches/paged/`: **39** `.patch` files spanning 0001-0048 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048. +- `patches/paged/`: **40** `.patch` files spanning 0001-0049 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata is 0048; W4A16 grouped-kernel shape tuning is 0049 and selects `bm32` by default. ### Bench artifacts (DGX) - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - the definitive same-session both-engine run. @@ -276,7 +276,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels) 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building. -2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`4b0cc1163`, patch 0048) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers. +2. **Both DGX checkouts are dirty** (`gated_delta_net.cu` modified in each), and the current clean local fork HEAD (`7dfa0e175`, patch 0049) differs from the dev-tree HEAD (`a7d439e`, M8 bf16) that actually produced the `COMBINED_DEFINITIVE` numbers. 3. **Worktree patch 0044 is now tracked here.** LocalAI commit `2033086f6` added `patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch`; the only current untracked path in this worktree is `.claude/`. 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch). 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign. diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-feat-paged-tune-W4A16-grouped-tile-shape.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-feat-paged-tune-W4A16-grouped-tile-shape.patch new file mode 100644 index 000000000..4d2736e44 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-feat-paged-tune-W4A16-grouped-tile-shape.patch @@ -0,0 +1,93 @@ +From 7dfa0e17548c5f04f83d2cc2a057b0a9941b599a Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 30 Jun 2026 21:44:54 +0000 +Subject: [PATCH] feat(paged): tune W4A16 grouped tile shape + +Select a BM=32 grouped W4A16 MoE prefill shape by default after the GB10 shape sweep, while keeping the prior 64x128 shape and additional diagnostics selectable through LLAMA_W4A16_SHAPE. + +Assisted-by: Codex:gpt-5 +--- + ggml/src/ggml-cuda/w4a16-gemm.cu | 42 +++++++++++++++++++++++++++++--- + 1 file changed, 38 insertions(+), 4 deletions(-) + +diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu +index 899e1a23f..ca8864292 100644 +--- a/ggml/src/ggml-cuda/w4a16-gemm.cu ++++ b/ggml/src/ggml-cuda/w4a16-gemm.cu +@@ -4,6 +4,7 @@ + #include + #include + #include ++#include + #include + + // =========================================================================== +@@ -281,7 +282,23 @@ bool ggml_cuda_w4a16_moe_grouped_should_engage( + return true; + } + +-void ggml_cuda_mul_mat_id_w4a16_grouped( ++static bool ggml_cuda_w4a16_use_base_shape() { ++ static const bool use_base = [] { ++ const char * e = getenv("LLAMA_W4A16_SHAPE"); ++ if (e == nullptr || e[0] == '\0' || strcmp(e, "default") == 0 || strcmp(e, "bm32") == 0 || strcmp(e, "32x128") == 0) { ++ return false; ++ } ++ if (strcmp(e, "base") == 0 || strcmp(e, "64x128") == 0) { ++ return true; ++ } ++ fprintf(stderr, "[w4a16] unknown LLAMA_W4A16_SHAPE=%s, using default bm32\n", e); ++ return false; ++ }(); ++ return use_base; ++} ++ ++template ++static void ggml_cuda_mul_mat_id_w4a16_grouped_impl( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, + const float * src1_sorted, +@@ -290,9 +307,7 @@ void ggml_cuda_mul_mat_id_w4a16_grouped( + int64_t n_experts, int64_t K, int64_t N, + cudaStream_t stream) { + GGML_ASSERT(src0->type == GGML_TYPE_NVFP4); +- GGML_ASSERT(N % 128 == 0 && K % 64 == 0); +- +- constexpr int BM = 64, BN = 128, WARPS_M = 2, WARPS_N = 4, STAGES = 2; ++ GGML_ASSERT(N % BN == 0 && K % 64 == 0); + + // host: build the per-M-tile expert map (ragged, no tile crosses an expert boundary) + int64_t total_rows = 0; +@@ -327,6 +342,8 @@ void ggml_cuda_mul_mat_id_w4a16_grouped( + } + fprintf(stderr, "[w4a16] engaged: total_rows=%lld n_experts=%lld K=%lld N=%lld n_tiles=%d max_tpe=%d multi_tile_experts=%d\n", + (long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi); ++ fprintf(stderr, "[w4a16] shape: BM=%d BN=%d WARPS_M=%d WARPS_N=%d STAGES=%d\n", ++ BM, BN, WARPS_M, WARPS_N, STAGES); + } + + // device: packed tile map; one pageable H2D copy instead of three tiny copies +@@ -363,3 +380,20 @@ void ggml_cuda_mul_mat_id_w4a16_grouped( + (int) N, (int) K, expert_stride_blocks); + CUDA_CHECK(cudaGetLastError()); + } ++ ++void ggml_cuda_mul_mat_id_w4a16_grouped( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, ++ const float * src1_sorted, ++ float * dst_sorted, ++ const int * tokens_per_expert, ++ int64_t n_experts, int64_t K, int64_t N, ++ cudaStream_t stream) { ++ if (ggml_cuda_w4a16_use_base_shape()) { ++ ggml_cuda_mul_mat_id_w4a16_grouped_impl<64, 128, 2, 4, 2>( ++ ctx, src0, src1_sorted, dst_sorted, tokens_per_expert, n_experts, K, N, stream); ++ } else { ++ ggml_cuda_mul_mat_id_w4a16_grouped_impl<32, 128, 1, 4, 2>( ++ ctx, src0, src1_sorted, dst_sorted, tokens_per_expert, n_experts, K, N, stream); ++ } ++} +-- +2.43.0 + diff --git a/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md b/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md new file mode 100644 index 000000000..2ecda4086 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md @@ -0,0 +1,105 @@ +# W4A16 Kernel Shape Phase 2 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Attack the remaining W4A16 prefill gap at the grouped kernel body, not metadata. + +**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; LocalAI patch series is regenerated only after the fork commit is validated. Keep W4A16 default-off unless `LLAMA_W4A16_PREFILL_M > 0`. + +## Task 1: Profile-Guided Target Selection + +- [x] Run `nsys` for default FP4-MMQ and forced W4A16 at `npp=512`. +- [x] Compare kernel attribution for metadata/cast/body costs. +- [x] Decide next implementation target from measured cost, not speculation. + +Result: `w4a16_grouped_kernel` is the dominant forced-W4A16 cost (`5231.667 ms`, `47.8%` of profiled GPU kernel time). `w4a16_cast_act_f32_bf16` is visible but much smaller (`517.195 ms`, `4.7%`). Phase 2 targets grouped-kernel tile shape/body first. + +## Task 2: Runtime Shape Selector + +**Files:** +- Modify fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [x] Add a small runtime selector for W4A16 grouped-kernel shape experiments. +- [x] Preserve the current `64x128` shape as the default path. +- [x] Add multiple candidate specializations behind an environment selector: a vLLM-inspired wider-`N` candidate, a ragged-M candidate, an occupancy candidate, and a deeper pipeline candidate. +- [x] Keep launch and shared-memory calculations template-safe for each specialization. + +## Task 3: DGX Validation And Kill Gate + +- [x] Build the fork on DGX from the updated source snapshot. +- [x] Run canonical paged MoE and dense greedy md5 gates after the final code change. +- [x] Confirm gate hashes match the established inferencing references before committing. +- [x] Run forced W4A16 A/B for default shape and candidate shape at `npp=512,2048`. +- [x] Run forced W4A16 `MUL_MAT_ID` op checks for selected `bm32` and old `base`. +- [x] Profile the winning candidate if it improves enough to understand the new bottleneck. +- [x] Record whether the candidate improves, regresses, or is neutral. + +Initial candidates: + +- `default` / `64x128`: current Phase 1 shape. +- `bn256`: wider N reuse, inspired by vLLM large-batch Marlin config. +- `bm32`: smaller M tiles for ragged MoE expert tails. +- `bn64`: smaller N tiles to test occupancy/latency limits. +- `stages3`: current tile shape with deeper `cp.async` pipeline. + +Kill gate: keep a shape candidate as the new default only if it improves forced W4A16 prefill throughput by at least 3% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. Otherwise revert or leave it as an off-by-env diagnostic only if it is useful for future sweeps. + +## Task 4: Mirror And Document + +- [x] Commit the accepted fork-first result with `Assisted-by: Codex:gpt-5`. +- [x] Regenerate only the new LocalAI patch mirror entry. +- [x] Verify the full LocalAI patch mirror applies to the base pin and matches fork HEAD. +- [x] Update `PARITY_HANDOFF.md` and phase results with artifact paths and decision. +- [x] Commit the LocalAI mirror/docs result with `Assisted-by: Codex:gpt-5`. + +Artifacts: + +- Profile directory: `~/bench/w4a16_phase1/profile` +- Candidate build directory: `~/llama-w4a16-phase2` +- Candidate benchmark directory: `~/bench/w4a16_phase2` + +Result: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| `base` / `64x128` | 1308.02 | 1339.46 | old baseline | +| `bn256` | 1286.99 | 1311.56 | rejected | +| `bm32` / `32x128` | 1442.99 | 1475.65 | selected | +| `bn64` | 1334.80 | 1362.55 | diagnostic only | +| `stages3` | 1271.01 | 1295.96 | rejected | +| `bn256x16` | 1084.66 | 1100.95 | rejected | + +Only `bm32` and the old `base` selector are shipped in patch `0049`. The other +candidate shapes were benchmarked in the Phase 2 build and then deliberately +left out to keep the upstream conflict surface small. + +Follow-up default verification with `LLAMA_W4A16_SHAPE` unset: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 | +| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 | + +Profile: + +- `bm32` `w4a16_grouped_kernel`: `4107.355 ms` (`41.7%`) at profiled `npp=512`. +- Phase 1 `64x128` `w4a16_grouped_kernel`: `5231.667 ms` (`47.8%`) at profiled `npp=512`. + +Canonical post-change gates: + +- MoE command: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is"