LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	faeb5b457c	analysis: NVFP4 closes the decode gap too (547->619, ~93% of vLLM) Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so it IS the better int4 decode GEMM the diagnosis called for - no multi-day Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill (1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 21:42:17 +00:00
Ettore Di Giacinto	6e0b910210	analysis: decode gap is GPU/kernel-bound, NOT host overhead (corrects premise) Rigorous re-measurement on pr24423: concurrent decode is GPU-compute-bound (~96% util, sampled), CUDA graphs ARE enabled at npl=128 (94/98 calls replay a captured graph; n_kv padded to 256 keeps topology stable), and graphs ON vs OFF is only +1.5% at npl=128. The earlier '20% GPU util / 170ms host' read was a windowing error (whole-run nsys vs decode-windowed). So no host/graph patch helps. The real 547->667 gap is the quantized DECODE GEMM: mul_mat_q (Q4_K/Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10 bandwidth floor (poorly tuned for the thin n=128 shape); vLLM's Marlin int4 runs closer. Lever = a Marlin-style int4 decode kernel for K-quants (or a Marlin-friendly int4 serving format), not host work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 21:32:58 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

faeb5b457c

analysis: NVFP4 closes the decode gap too (547->619, ~93% of vLLM)

Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to
vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at
the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so
it IS the better int4 decode GEMM the diagnosis called for - no multi-day
Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill
(1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 21:42:17 +00:00

Ettore Di Giacinto

6e0b910210

analysis: decode gap is GPU/kernel-bound, NOT host overhead (corrects premise)

Rigorous re-measurement on pr24423: concurrent decode is GPU-compute-bound (~96%
util, sampled), CUDA graphs ARE enabled at npl=128 (94/98 calls replay a captured
graph; n_kv padded to 256 keeps topology stable), and graphs ON vs OFF is only
+1.5% at npl=128. The earlier '20% GPU util / 170ms host' read was a windowing
error (whole-run nsys vs decode-windowed). So no host/graph patch helps. The real
547->667 gap is the quantized DECODE GEMM: mul_mat_q (Q4_K/Q6_K) is ~68% of decode
GPU time and runs ~2.1x above the GB10 bandwidth floor (poorly tuned for the thin
n=128 shape); vLLM's Marlin int4 runs closer. Lever = a Marlin-style int4 decode
kernel for K-quants (or a Marlin-friendly int4 serving format), not host work.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-21 21:32:58 +00:00

2 Commits