docs(paged): record W4A16 direct activation rejection

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 11:28:11 +00:00
parent 4645935fa5
commit f7d76389b0
5 changed files with 234 additions and 13 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3404,3 +3404,35 @@ Decision:
 - Any future W4A16 parity work must be a larger redesign that improves the
  grouped kernel body and removes or fuses the sorted activation gather. Do not
  reopen the low-conflict micro-patch track.
+
+## W4A16 Direct-Activation Phase61 Result
+
+Phase61 tested the larger W4A16 direct-activation redesign. It passed default
+inference gates and opt-in direct-A correctness:
+
+- Default gates artifact:
+  `/home/mudler/bench/phase61_direct_default_gates/20260701_132057`
+- A/B artifact: `/home/mudler/bench/phase61_direct_ab/20260701_132237`
+- Default MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+- Default dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+- `MUL_MAT`: `1146/1146`
+- `MUL_MAT_ID`: `806/806`
+- Forced W4A16 and direct-A MoE md5:
+  `07db32c2bcb78d17a43ed18bc22705cd`
+
+The direct path had to mirror `get_rows_cuda` flat-row source addressing. A
+token/slot decode of `ids_to_sorted` failed `b=1` NVFP4 op cases; flat
+`src_row*nb11` addressing fixed the gate.
+
+MoE prefill A/B (`npl=32`, `ntg=4`):
+
+| path | npp512 S_PP | npp2048 S_PP |
+|------|-------------|--------------|
+| default FP4-MMQ | `2325.45` | `2423.18` |
+| forced W4A16 | `1471.05` | `1502.46` |
+| forced W4A16 direct-A | `1566.30` | `1605.82` |
+
+Decision: reject. Direct-A improved forced W4A16 by only `+6.5%` and `+6.9%`,
+and still reached only `0.67x` / `0.66x` of default FP4-MMQ. The rejected direct
+kernel diff was saved to `/tmp/phase61-w4a16-direct-a-rejected.diff` and not
+committed. Do not continue W4A16 body tuning on GB10 as the next parity lever.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -742,6 +742,16 @@ grouped-kernel body rewrite. Keep it only if it improves forced W4A16 S_PP by at
 least `+12%` and reaches at least `0.75x` default FP4-MMQ; otherwise reject and
 do not continue W4A16 body tuning.

+Phase61 result: rejected. The direct-A kernel passed correctness after matching
+`get_rows_cuda` flat-row addressing (`MUL_MAT_ID` `806/806`; forced/direct-A
+MoE transcript md5 both `07db32c2bcb78d17a43ed18bc22705cd`) and default gates
+remained green (`8cb0ce23`, `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`). But direct-A only improved forced W4A16 S_PP `1471.05 -> 1566.30`
+at `npp=512` and `1502.46 -> 1605.82` at `npp=2048` (`+6.5%` / `+6.9%`), still
+just `0.67x` / `0.66x` of default FP4-MMQ. The direct kernel diff was not
+committed; only the safe policy/routing stub remains in the fork. Do not pursue
+more W4A16 body tuning on GB10 as the next parity lever.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1518,6 +1518,43 @@ close a `37-39%` S_PP loss, and the dominant loss is the grouped kernel body
 plus sorted activation movement. Future W4A16 parity work must be a larger
 design that changes those structures, not another metadata/body shortcut.

+### Phase 61 W4A16 direct activation kill-gate
+
+Phase61 implemented the larger direct-activation experiment behind
+`LLAMA_W4A16_DIRECT_A=1`, consuming original `src1` and `ids_to_sorted` directly
+instead of materializing `src1_sorted` and then casting it to bf16. The correct
+source addressing matched `get_rows_cuda`: `ids_to_sorted` is a flat source-row
+index addressed with `nb11`. The initial token/slot decode failed `b=1` op
+tests; the flat-row fix passed forced direct-A `MUL_MAT_ID` `806/806`.
+
+Artifacts:
+
+- default gates: `/home/mudler/bench/phase61_direct_default_gates/20260701_132057`
+- A/B: `/home/mudler/bench/phase61_direct_ab/20260701_132237`
+
+Gates:
+
+- default MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
+- default dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
+- `MUL_MAT` `1146/1146`
+- `MUL_MAT_ID` `806/806`
+- forced W4A16 and direct-A MoE transcripts both
+  `07db32c2bcb78d17a43ed18bc22705cd`
+
+MoE prefill A/B (`npl=32`, `ntg=4`):
+
+| path | npp512 S_PP | npp2048 S_PP |
+|------|-------------|--------------|
+| default FP4-MMQ | `2325.45` | `2423.18` |
+| forced W4A16 | `1471.05` | `1502.46` |
+| forced W4A16 direct-A | `1566.30` | `1605.82` |
+
+Decision: reject. Direct-A improved forced W4A16 by only `+6.5%` / `+6.9%` and
+remained `0.67x` / `0.66x` of default FP4-MMQ, below the `+12%` and `0.75x`
+keep gates. The direct kernel diff was saved to
+`/tmp/phase61-w4a16-direct-a-rejected.diff` and not committed. W4A16 body
+tuning is no longer the next GB10 parity lever.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update