Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto 2975a74fb4 docs(paged): Qwen3.6 NVFP4 apples-to-apples scorecard (llama vs vLLM, dense + MoE)
Full 4-way sweep (npl 8/32/64/128): dense Qwen3.6-27B (clean W4A4) + MoE
Qwen3.6-35B-A3B (vLLM Marlin NvFp4). Parity at npl8; vLLM scales ~2.8-2.9x ahead
on decode at npl128. llama TTFT explodes at high concurrency - run WITHOUT
max_prefill_tokens (0013), the prefill-starvation also drags decode_agg; fair
re-run with the QoS budget pending. llama wins on on-demand memory (paged).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-23 20:21:50 +00:00
..
2026-04-12 08:51:30 +02:00