From 8b413d1cbd8215cac3f91a42fb0db99da014f08c Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 30 Jun 2026 22:06:17 +0000 Subject: [PATCH] docs(paged): record W4A16 scale broadcast rejection Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 35 ++++++++++++ ...2026-06-30-w4a16-scale-broadcast-phase3.md | 56 +++++++++++++++++++ 2 files changed, 91 insertions(+) create mode 100644 docs/superpowers/plans/2026-06-30-w4a16-scale-broadcast-phase3.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index e41ce8c31..b98eb1f64 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -235,6 +235,41 @@ Mirror invariant after patch `0049`: `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`. - Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`. +## W4A16 Scale Broadcast Phase 3 + +Goal: reduce duplicate FP4 scale conversion inside `w4a16_grouped_kernel` by +having one lane per 4-lane group convert the `ue4m3` scale and broadcast it with +`__shfl_sync`. + +Artifacts: + +- Build: `~/llama-w4a16-phase3` +- Logs: `~/bench/w4a16_phase3` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` and old `base` shape md5s matched each other: + `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 3 scale-broadcast `bm32` | 1392.46 | 1422.74 | rejected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 3 scale-broadcast `base` | 1201.69 | 1221.25 | rejected | + +Result: + +- Rejected. No fork commit and no LocalAI patch `0050`. +- The local fork experiment was reverted. +- Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or + scheduling cost exceeds the saved duplicate scale conversion. + ## Clean Build First clean build attempt: diff --git a/docs/superpowers/plans/2026-06-30-w4a16-scale-broadcast-phase3.md b/docs/superpowers/plans/2026-06-30-w4a16-scale-broadcast-phase3.md new file mode 100644 index 000000000..432d1d597 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-scale-broadcast-phase3.md @@ -0,0 +1,56 @@ +# W4A16 Scale Broadcast Phase 3 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Test a minimal W4A16 grouped-kernel body optimization after Phase 2 selected `bm32`. + +**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; mirror into LocalAI only after build, md5, op, perf, and mirror gates pass. Keep patch `0050` incremental on top of `0049`, and keep the source diff small. + +## Task 1: Implement Scale Broadcast + +- [x] In `ggml/src/ggml-cuda/w4a16-gemm.cu`, replace per-lane duplicate `ggml_cuda_ue4m3_to_fp32` scale conversion with one conversion per 4-lane `n_local` group plus `__shfl_sync`. +- [x] Keep the existing dequant and MMA order unchanged. +- [x] Do not add broad diagnostic variants or extra launch shapes. + +## Task 2: Gates + +- [x] Build `llama-batched-bench`, `llama-completion`, and `test-backend-ops` on DGX. +- [x] Run canonical default-off paged MoE and dense greedy md5 gates. +- [x] Run forced W4A16 `bm32` vs `base` md5 gates on the canonical prompt. +- [x] Run forced W4A16 `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1`. +- [x] Run W4A16 default `bm32` A/B against Phase 2 at `npp=512,2048`. + +## Task 3: Disposition + +- [x] Keep only if it improves W4A16 prefill by at least 1% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. +- [x] If kept, commit fork-first with `Assisted-by: Codex:gpt-5`, generate patch `0050`, verify mirror tree hash, update docs, and commit LocalAI. Not taken: perf gate failed. +- [x] If rejected, revert the fork experiment and record the result without adding a patch. + +Result: rejected, no fork commit and no LocalAI patch `0050`. + +Artifacts: + +- Build: `~/llama-w4a16-phase3` +- Logs: `~/bench/w4a16_phase3` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `base` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 3 scale-broadcast `bm32` | 1392.46 | 1422.74 | rejected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 3 scale-broadcast `base` | 1201.69 | 1221.25 | rejected | + +Disposition: + +- Reverted local fork experiment in `/home/mudler/_git/llama.cpp`. +- Do not retry this exact scale-broadcast approach; shuffle overhead and/or compiler scheduling cost exceeds saved FP8 scale conversion on GB10.