From c8b1f165076ca80016a0403789bbf888aa684829 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 21:22:07 +0000 Subject: [PATCH] docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget sweep Re-run the dense Qwen3.6-27B NVFP4 vs vLLM A/B with patch 0013's QoS prefill budget enabled (LLAMA_PREFILL_BUDGET swept over 256/512/1024), fixing the prior run that left prefill unbounded and let high-concurrency prefills starve each other. At the saturated npl128 point budget=256 is the best lever: decode_agg 134.6 -> 161.2 tok/s (+19.8%) and TTFT 491.2 s -> 305.4 s (-37.8%) vs the starved stock run, moving llama from 34.5% to 41.3% of vLLM decode. Larger budgets help less; at light/moderate concurrency the budget is net-negative for TTFT because this all-at-once workload has no in-flight decode to protect at t=0. Documented honestly: a real but narrow high-concurrency lever, not a gap-closer (vLLM still ~2.4x decode / ~12x lower TTFT at npl128). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/QWEN36_NVFP4_BENCH.md | 60 +++++++++++++++++++ 1 file changed, 60 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md index 6b45f2e17..dcf284e94 100644 --- a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md +++ b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md @@ -106,3 +106,63 @@ batching efficiency at high concurrency** - the runtime/scheduler lever (the sma regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the absolute gap. + +--- + +## Fair re-run (max_prefill_tokens on) + +The prior tables ran llama-server **without** the QoS prefill budget (patch 0013). This section +re-runs the same A/B with `LLAMA_PREFILL_BUDGET` set, sweeping the per-step prompt-token cap over +**256 / 512 / 1024**. Everything else is byte-identical to the prior run: dev-tree llama-server +(branch paged, HEAD `151343b`), `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`, +`LLAMA_KV_PAGED=1`, same workload (512-token unique prompt, `max_tokens=256`, `temperature=0`, +`ignore_eos`), same harness (`h2h_moe_sweep.sh` -> `h2h_cli.py`). vLLM numbers are unchanged +(carried over from the committed dense table, not re-run). + +### DENSE Qwen3.6-27B - budget sweep (decode agg tok/s | TTFT mean ms | peak GB) + +| npl | metric | stock (no budget) | budget 256 | budget 512 | budget 1024 | vLLM | +|----:|--------|------------------:|-----------:|-----------:|------------:|-----:| +| 8 | decode agg | 63.8 | 63.5 | 63.8 | 63.5 | 64.3 | +| 8 | TTFT ms | 2029 | 4255 | 3756 | 2653 | 2593 | +| 32 | decode agg | 108.9 | 105.7 | 107.7 | 108.8 | 189.8 | +| 32 | TTFT ms | 13212 | 23114 | 18934 | 13912 | 7477 | +| 64 | decode agg | 126.2 | 132.0 | 131.2 | 118.2 | 284.2 | +| 64 | TTFT ms | 53818 | 109455 | 74272 | 92450 | 12942 | +| 128 | decode agg | 134.6 | **161.2** | 146.9 | 128.3 | 390.7 | +| 128 | TTFT ms | 491195| **305423**| 543448| 424058| 24806 | + +Peak host GB is budget-independent (on-demand paged KV grows with concurrency): ~51.5 (npl8) -> +~61.5 (npl32) -> ~74.7 (npl64) -> ~93.5 (npl128) for every budget, vs vLLM's flat ~112.1. + +### Best budget = 256 (only the saturated npl128 regime benefits) + +At the fully-saturated point (npl128), **budget 256 is the clear winner on both axes**: + +- **decode_agg: 134.6 -> 161.2 tok/s (+19.8%)** vs the starved stock run. +- **TTFT mean: 491.2 s -> 305.4 s (-37.8%, -186 s)** vs stock. +- llama decode as % of vLLM at npl128: **34.5% -> 41.3%**. TTFT still ~12x vLLM's 24.8 s. + +Larger budgets help less at npl128 (512 -> 146.9 tok/s; 1024 -> 128.3, i.e. ~stock) because a +looser cap lets a long prefill grab a bigger slice per step and re-introduce decode jitter. So +the tightest cap (256) protects in-flight decode the most when the box is saturated. + +### Honest caveat: this bursty workload is the worst case for TTFT + +At npl 8 / 32 / 64 the budget **raised** TTFT (e.g. npl8 2029 -> 4255 ms at budget 256) and left +decode_agg roughly flat. Reason: the harness fires all N requests simultaneously, so at t=0 there +is **no in-flight decode to protect** - capping prefill purely defers first tokens. The budget +only pays off once enough slots are decoding that an unbounded prefill would starve them, which on +this box happens only at npl128. Budget 1024 tracks stock closely at light load (npl8 TTFT 2653 ~ +stock 2029) because a 512-token prompt fits in one <=1024 step. In a steadier (staggered) arrival +pattern the budget would protect decode jitter without the burst-TTFT penalty; that regime is not +exercised here. + +### Bottom line (dense) + +The prefill budget is a **real but narrow** lever on this workload: at maximum saturation +(npl128) budget=256 lifts decode_agg ~20% and cuts TTFT ~38% vs the starved run, moving llama +from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still decodes ~2.4x +faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At +light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it +should be applied selectively (high-concurrency serving), not as an unconditional default.