Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto 122df1c620 analysis: vLLM throughput gap decomposed - spec-dec is the per-user lever
Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound).
vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is
best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on
dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3
head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps
Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate
(TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-20 08:40:20 +00:00
..
2026-04-12 08:51:30 +02:00