docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget sweep

Re-run the dense Qwen3.6-27B NVFP4 vs vLLM A/B with patch 0013's QoS
prefill budget enabled (LLAMA_PREFILL_BUDGET swept over 256/512/1024),
fixing the prior run that left prefill unbounded and let high-concurrency
prefills starve each other.

At the saturated npl128 point budget=256 is the best lever: decode_agg
134.6 -> 161.2 tok/s (+19.8%) and TTFT 491.2 s -> 305.4 s (-37.8%) vs the
starved stock run, moving llama from 34.5% to 41.3% of vLLM decode. Larger
budgets help less; at light/moderate concurrency the budget is net-negative
for TTFT because this all-at-once workload has no in-flight decode to protect
at t=0. Documented honestly: a real but narrow high-concurrency lever, not a
gap-closer (vLLM still ~2.4x decode / ~12x lower TTFT at npl128).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-23 21:22:07 +00:00
parent 2975a74fb4
commit c8b1f16507

View File

@@ -106,3 +106,63 @@ batching efficiency at high concurrency** - the runtime/scheduler lever (the sma
regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget
on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the
absolute gap.
---
## Fair re-run (max_prefill_tokens on)
The prior tables ran llama-server **without** the QoS prefill budget (patch 0013). This section
re-runs the same A/B with `LLAMA_PREFILL_BUDGET` set, sweeping the per-step prompt-token cap over
**256 / 512 / 1024**. Everything else is byte-identical to the prior run: dev-tree llama-server
(branch paged, HEAD `151343b`), `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
`LLAMA_KV_PAGED=1`, same workload (512-token unique prompt, `max_tokens=256`, `temperature=0`,
`ignore_eos`), same harness (`h2h_moe_sweep.sh` -> `h2h_cli.py`). vLLM numbers are unchanged
(carried over from the committed dense table, not re-run).
### DENSE Qwen3.6-27B - budget sweep (decode agg tok/s | TTFT mean ms | peak GB)
| npl | metric | stock (no budget) | budget 256 | budget 512 | budget 1024 | vLLM |
|----:|--------|------------------:|-----------:|-----------:|------------:|-----:|
| 8 | decode agg | 63.8 | 63.5 | 63.8 | 63.5 | 64.3 |
| 8 | TTFT ms | 2029 | 4255 | 3756 | 2653 | 2593 |
| 32 | decode agg | 108.9 | 105.7 | 107.7 | 108.8 | 189.8 |
| 32 | TTFT ms | 13212 | 23114 | 18934 | 13912 | 7477 |
| 64 | decode agg | 126.2 | 132.0 | 131.2 | 118.2 | 284.2 |
| 64 | TTFT ms | 53818 | 109455 | 74272 | 92450 | 12942 |
| 128 | decode agg | 134.6 | **161.2** | 146.9 | 128.3 | 390.7 |
| 128 | TTFT ms | 491195| **305423**| 543448| 424058| 24806 |
Peak host GB is budget-independent (on-demand paged KV grows with concurrency): ~51.5 (npl8) ->
~61.5 (npl32) -> ~74.7 (npl64) -> ~93.5 (npl128) for every budget, vs vLLM's flat ~112.1.
### Best budget = 256 (only the saturated npl128 regime benefits)
At the fully-saturated point (npl128), **budget 256 is the clear winner on both axes**:
- **decode_agg: 134.6 -> 161.2 tok/s (+19.8%)** vs the starved stock run.
- **TTFT mean: 491.2 s -> 305.4 s (-37.8%, -186 s)** vs stock.
- llama decode as % of vLLM at npl128: **34.5% -> 41.3%**. TTFT still ~12x vLLM's 24.8 s.
Larger budgets help less at npl128 (512 -> 146.9 tok/s; 1024 -> 128.3, i.e. ~stock) because a
looser cap lets a long prefill grab a bigger slice per step and re-introduce decode jitter. So
the tightest cap (256) protects in-flight decode the most when the box is saturated.
### Honest caveat: this bursty workload is the worst case for TTFT
At npl 8 / 32 / 64 the budget **raised** TTFT (e.g. npl8 2029 -> 4255 ms at budget 256) and left
decode_agg roughly flat. Reason: the harness fires all N requests simultaneously, so at t=0 there
is **no in-flight decode to protect** - capping prefill purely defers first tokens. The budget
only pays off once enough slots are decoding that an unbounded prefill would starve them, which on
this box happens only at npl128. Budget 1024 tracks stock closely at light load (npl8 TTFT 2653 ~
stock 2029) because a 512-token prompt fits in one <=1024 step. In a steadier (staggered) arrival
pattern the budget would protect decode jitter without the burst-TTFT penalty; that regime is not
exercised here.
### Bottom line (dense)
The prefill budget is a **real but narrow** lever on this workload: at maximum saturation
(npl128) budget=256 lifts decode_agg ~20% and cuts TTFT ~38% vs the starved run, moving llama
from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still decodes ~2.4x
faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At
light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it
should be applied selectively (high-concurrency serving), not as an unconditional default.