LocalAI

mirror/LocalAI

Fork 0

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-24 16:49:06 -04:00

Commit Graph

Author	SHA1	Message	Date
Ettore Di Giacinto	a3abd60ae0	docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM) Same-day steady-state aggregate-decode sweep at npl 8/32/64/128 for three model classes, replacing the stale ~75-80%-of-vLLM carried figure with a full concurrency curve. Findings: - Dense 32B (NVFP4 vs NVFP4A16): parity at batch-8 (97%), 72-86% mid/high. - Small 0.6B: parity at batch-8 (99%), 49-67% at high concurrency (llama plateaus ~2.0k, vLLM scales to 4.2k; runtime/scheduler-bound). - MoE 30B-A3B: llama-only at 290-1041 tok/s. vLLM cannot serve it on GB10 (bf16 hangs at MoE warmup and reboots the box, twice; mxfp4 GGUF expert tensors unmappable by vLLM 0.23.0). Batch-8 anomaly resolved: clean isolated dense batch-8 decode is ~88-90 tok/s (~89 ms/step) across paged-vs-stock (within 2%, paged slightly faster) and ctx 65536-vs-163840 (within 1%). The prior 471 ms/step was a mixed-load decode/prefill contention artifact, not paged overhead, ctx allocation, or NVFP4 cost - the case patch 0013 LLAMA_PREFILL_BUDGET bounds. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 12:22:15 +00:00

Author

SHA1

Message

Date

Ettore Di Giacinto

a3abd60ae0

docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)

Same-day steady-state aggregate-decode sweep at npl 8/32/64/128 for three
model classes, replacing the stale ~75-80%-of-vLLM carried figure with a
full concurrency curve.

Findings:
- Dense 32B (NVFP4 vs NVFP4A16): parity at batch-8 (97%), 72-86% mid/high.
- Small 0.6B: parity at batch-8 (99%), 49-67% at high concurrency
  (llama plateaus ~2.0k, vLLM scales to 4.2k; runtime/scheduler-bound).
- MoE 30B-A3B: llama-only at 290-1041 tok/s. vLLM cannot serve it on GB10
  (bf16 hangs at MoE warmup and reboots the box, twice; mxfp4 GGUF expert
  tensors unmappable by vLLM 0.23.0).

Batch-8 anomaly resolved: clean isolated dense batch-8 decode is ~88-90
tok/s (~89 ms/step) across paged-vs-stock (within 2%, paged slightly
faster) and ctx 65536-vs-163840 (within 1%). The prior 471 ms/step was a
mixed-load decode/prefill contention artifact, not paged overhead, ctx
allocation, or NVFP4 cost - the case patch 0013 LLAMA_PREFILL_BUDGET bounds.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-23 12:22:15 +00:00

1 Commits