LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-24 08:38:51 -04:00

Files

Ettore Di Giacinto c8b1f16507 docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget sweep

Re-run the dense Qwen3.6-27B NVFP4 vs vLLM A/B with patch 0013's QoS
prefill budget enabled (LLAMA_PREFILL_BUDGET swept over 256/512/1024),
fixing the prior run that left prefill unbounded and let high-concurrency
prefills starve each other.

At the saturated npl128 point budget=256 is the best lever: decode_agg
134.6 -> 161.2 tok/s (+19.8%) and TTFT 491.2 s -> 305.4 s (-37.8%) vs the
starved stock run, moving llama from 34.5% to 41.3% of vLLM decode. Larger
budgets help less; at light/moderate concurrency the budget is net-negative
for TTFT because this all-at-once workload has no in-flight decode to protect
at t=0. Documented honestly: a real but narrow high-concurrency lever, not a
gap-closer (vLLM still ~2.4x decode / ~12x lower TTFT at npl128).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-23 21:22:07 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget sweep