mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 08:08:52 -04:00
kernel: dense MXFP4 test = free 1.44x (765->1153) but FP4-MMA untuned (~17% of ceiling)
MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17% of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -72,12 +72,29 @@ ceiling, duplicates work the FP4-MMA path already does, and FP4 on sm_121 is a *
|
||||
problem. The `fp4-grouped-moe.cu` scaffold/hook stays as a useful dispatch seam, but the kernel behind it
|
||||
should be one of (1)/(2)/(3), not a greenfield CUTLASS collective.
|
||||
|
||||
## 6. Cheap experiment worth running next
|
||||
## 6. Cheap experiment — RESULT: MXFP4 dense = free 1.44×, but not parity (kernel still untuned)
|
||||
|
||||
Quantize a **dense** model to **MXFP4/NVFP4** and benchmark prefill: does the existing FP4-MMA path lift dense
|
||||
from ~765 (Q4_K int8-MMQ) toward the FP4 ceiling, as it does for MoE (3441)? If yes, **dense parity may be a
|
||||
quantization choice + the existing kernel**, no new kernel — modulo the sm_121 build/miscompile fixes (3).
|
||||
(Needs an F16 source or a lossy Q4_K→MXFP4 requant for a speed-only test.)
|
||||
Requantized Qwen3-32B dense → MXFP4 (forced attn+ffn to mxfp4 via `--tensor-type`, `--allow-requantize`,
|
||||
speed-only test) and benched prefill:
|
||||
|
||||
| quant | kernel | pp512 | pp2048 | vs Q4_K |
|
||||
|---|---|---|---|---|
|
||||
| Q4_K_M | int8-MMQ | 765 | 763 | 1.0× |
|
||||
| **MXFP4** | **FP4-MMA** | **1099** | **1153** | **1.44×** |
|
||||
|
||||
**Findings:**
|
||||
- **MXFP4 dense is a real, free 1.44× over Q4_K** — just a requantize, the existing FP4-MMA path engages for
|
||||
dense weights on GB10. Worth shipping as a **Blackwell dense-quant recommendation** in the gallery (no kernel).
|
||||
- **But it is NOT parity.** 1153 t/s = **~17% of the FP4 ceiling (~6,600)** / ~35% of the BF16 ceiling. So the
|
||||
**FP4-MMA kernel is itself untuned** (consistent with the MoE measurement, ~5% of FP4 peak). MXFP4 moves dense
|
||||
from the int8 path (765) onto the FP4 path (1153), but the FP4 kernel leaves ~4–6× on the table.
|
||||
- **So the kernel work is confirmed and now precise: tune the FP4-MMA kernel** (it's the highest-value, since it
|
||||
serves both dense-MXFP4 and MoE, and FP4 is the only path that can *beat* vLLM). Strategy item (3) — fix +
|
||||
tune the existing FP4-MMA on sm_121 — is the priority; a Marlin-style W4A16 BF16 kernel (2) is the alternative
|
||||
to *match* on the BF16 ceiling if FP4 tuning stalls.
|
||||
|
||||
Conclusion: the cheap test did NOT collapse the kernel problem (the kernels are untuned, not just the quant), but
|
||||
it (a) gives a free 1.44× to ship now, and (b) sharpens the target to **tuning the FP4-MMA kernel**.
|
||||
|
||||
## Sources
|
||||
GB10 peaks (measured): forums.developer.nvidia.com/t/351993, /360142, /373618. Marlin: github.com/IST-DASLab/marlin,
|
||||
|
||||
Reference in New Issue
Block a user