From f9e015d8e22ce4e31078bb2aeddaadd549b505c4 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 30 Jun 2026 22:23:14 +0000 Subject: [PATCH] docs(paged): record W4A16 Wq padding rejection Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 34 +++++++++++ .../plans/2026-06-30-w4a16-wq-pad-phase5.md | 56 +++++++++++++++++++ 2 files changed, 90 insertions(+) create mode 100644 docs/superpowers/plans/2026-06-30-w4a16-wq-pad-phase5.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index b060caa49..66ba1ffc3 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -311,6 +311,40 @@ Result: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`. - Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`. +## W4A16 Wq Padding Phase 5 + +Goal: test whether padding the quantized-weight shared-memory row stride gives +another low-conflict W4A16 grouped-kernel body win after `0050`. + +Artifacts: + +- Build: `~/llama-w4a16-phase5` +- Logs: `~/bench/w4a16_phase5` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` and old `base` shape md5s matched each other: + `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | baseline | +| Phase 5 Wq-pad `bm32` | 1472.36 | 1504.82 | rejected: below 1% gate | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | baseline | +| Phase 5 Wq-pad `base` | 1337.70 | 1368.48 | diagnostic | + +Result: + +- Rejected. No fork commit and no LocalAI patch `0051`. +- The local fork experiment was reverted. +- Do not ship Wq padding alone; the measured `+0.4%` / `+0.6%` default-shape + gain is below the maintenance threshold. + ## Clean Build First clean build attempt: diff --git a/docs/superpowers/plans/2026-06-30-w4a16-wq-pad-phase5.md b/docs/superpowers/plans/2026-06-30-w4a16-wq-pad-phase5.md new file mode 100644 index 000000000..15ad19937 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-wq-pad-phase5.md @@ -0,0 +1,56 @@ +# W4A16 Wq Shared-Memory Padding Phase 5 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Test whether padding the grouped W4A16 quantized-weight shared-memory row stride improves the post-`0050` kernel. + +**Scope:** Fork-first experiment on top of `0050`. Keep it separate and incremental. Ship no patch unless it passes md5/op gates and improves prefill. + +## Task 1: Implement Wq Padding + +- [x] Add a Wq shared-memory row-stride constant. +- [x] Pad Wq rows by 4 `uint32_t` slots. +- [x] Update only Wq copy and Wq byte-indexing; do not change A padding, Wd layout, dequant math, MMA order, metadata, or launch shape. + +## Task 2: Gates + +- [x] Build `llama-batched-bench`, `llama-completion`, and `test-backend-ops` on DGX. +- [x] Run canonical default-off paged MoE and dense greedy md5 gates. +- [x] Run forced W4A16 `bm32` vs `base` md5 gates. +- [x] Run forced W4A16 `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1`. +- [x] Run W4A16 default `bm32` A/B against Phase 4 at `npp=512,2048`. + +## Task 3: Disposition + +- [x] Keep only if it improves W4A16 prefill by at least 1% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. +- [x] If kept, commit fork-first with `Assisted-by: Codex:gpt-5`, generate patch `0051`, verify mirror tree hash, update docs, and commit LocalAI. Not taken: perf gate did not clear 1%. +- [x] If rejected, revert the fork experiment and record the result without adding a patch. + +Result: rejected, no fork commit and no LocalAI patch `0051`. + +Artifacts: + +- Build: `~/llama-w4a16-phase5` +- Logs: `~/bench/w4a16_phase5` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `base` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | baseline | +| Phase 5 Wq-pad `bm32` | 1472.36 | 1504.82 | rejected: below 1% gate | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | baseline | +| Phase 5 Wq-pad `base` | 1337.70 | 1368.48 | diagnostic | + +Disposition: + +- Reverted local fork experiment in `/home/mudler/_git/llama.cpp`. +- Do not ship Wq padding alone; the measured gain is below the maintenance threshold.