docs(paged): scope W4A16 direct activation experiment

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 10:59:56 +00:00
parent fc5d5e4ff3
commit ef578866c8
2 changed files with 554 additions and 0 deletions

View File

@@ -733,6 +733,15 @@ work needs a larger redesign that improves the grouped kernel body and removes
or fuses sorted activation movement. Near-term GB10 parity work should return to
broader prefill/GDN/MoE design or hardware-pivot benchmarking.
Phase61 is scoped as that larger W4A16 kill-gate, not as a committed code
change: `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md`.
It proposes a default-off `LLAMA_W4A16_DIRECT_A=1` experiment that consumes the
original activation tensor plus the existing `ids_to_sorted` map directly,
removing Phase60's sorted activation gather and separate cast kernels before any
grouped-kernel body rewrite. Keep it only if it improves forced W4A16 S_PP by at
least `+12%` and reaches at least `0.75x` default FP4-MMQ; otherwise reject and
do not continue W4A16 body tuning.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)