mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-25 00:59:28 -04:00
Rewrite the track-B scope into the definitive build-ready plan for the
NVFP4 FP4-MMA decode GEMM toward vLLM GB10 parity. Source-read of the
mmq.cuh/mma.cuh/quantize.cu FP4 path on the dgx paged dev tree settles two
load-bearing facts the prior draft got partly wrong:
- llama's dense path is already TRUE W4A4 (block_fp4_mmq packs 256 e2m1
values + ue4m3 scales; the MMA is kind::mxf4nvf4 e2m1.e2m1...ue4m3), so
there is no activation-bit-width work to do; the whole dense deficit is
scheduling/occupancy.
- the mmq_x selector minimizes ntiles_x, which PINS dense decode at
mmq_x=128 (weights read once). Shrinking mmq_x re-reads the 18 GB
weights, so the dense occupancy lever is mmq_y-down (BW-neutral), NOT
mmq_x-down; MoE's free lever is the per-expert mmq_x-down (patch 0015).
Adds the explicit kernel-approach decision (tune the existing FP4-MMA
mul_mat_q; reject the cutlass-SM120 rewrite, dead on GB10 and broken on
sm_121; reject the BF16-Marlin descent), the concrete build-ready changes
(mmq_y/granularity/stream-k knobs, FP4-MMA fragment invariants, the
ue4m3 scale path, and the block_fp4_mmq y-tile ABI contract for the
track-A act-quant fusion handoff), the GB10-fit rules, the bit-exact
test-backend-ops gate with decode-shape + ragged-M cases, and per-phase
expected decode_agg tables.
Verdict (honest, roofline-grounded): the decode GEMM is bandwidth-bound on
the hardware roofline (M=128 << crossover 611; weight-read floors 4-6x
above vLLM) but compute-bound in practice at ~3% FP4 eff, so 273 GB/s is
not the wall. DENSE: GO (conditional) - B+A reaches 376-394 tok/s =
90-103% of vLLM 391, gated by a P2 occupancy kill-gate (<15% FP4 eff ->
parity off). MoE: PARTIAL/NO-GO - ceiling ~76% of 811 (618) from the GEMM
alone; full MoE parity needs the non-GEMM tracks too.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>