Same-day steady-state aggregate-decode sweep at npl 8/32/64/128 for three
model classes, replacing the stale ~75-80%-of-vLLM carried figure with a
full concurrency curve.
Findings:
- Dense 32B (NVFP4 vs NVFP4A16): parity at batch-8 (97%), 72-86% mid/high.
- Small 0.6B: parity at batch-8 (99%), 49-67% at high concurrency
(llama plateaus ~2.0k, vLLM scales to 4.2k; runtime/scheduler-bound).
- MoE 30B-A3B: llama-only at 290-1041 tok/s. vLLM cannot serve it on GB10
(bf16 hangs at MoE warmup and reboots the box, twice; mxfp4 GGUF expert
tensors unmappable by vLLM 0.23.0).
Batch-8 anomaly resolved: clean isolated dense batch-8 decode is ~88-90
tok/s (~89 ms/step) across paged-vs-stock (within 2%, paged slightly
faster) and ctx 65536-vs-163840 (within 1%). The prior 471 ms/step was a
mixed-load decode/prefill contention artifact, not paged overhead, ctx
allocation, or NVFP4 cost - the case patch 0013 LLAMA_PREFILL_BUDGET bounds.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>