From 362eea90ffd52411a62b1d487b51fc0b5db23116 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 21:39:22 +0000 Subject: [PATCH] docs(paged): fair re-run verdict - synthesize NVFP4 llama vs vLLM scorecard Phase 3 synthesis of the max_prefill_tokens (patch 0013) fair re-run: how much of the gap was prefill starvation, the genuine remaining gap to vLLM, and where par-or-beat stands per concurrency/model. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../patches/paged/QWEN36_NVFP4_BENCH.md | 102 ++++++++++++++++++ 1 file changed, 102 insertions(+) diff --git a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md index aba4fabc4..b9b9b0b7b 100644 --- a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md +++ b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md @@ -227,3 +227,105 @@ decode - the same ~41% ceiling the dense run hit. It does **not** close the gap: monotonically and steeply where llama only partially recovers. Net: apply the budget to saturated MoE serving when decode throughput is the objective and some extra TTFT is acceptable; for latency-sensitive MoE serving leave it off (stock TTFT was already not the bottleneck here). + +--- + +## Fair re-run verdict + +This is the synthesis after patch 0013 (`max_prefill_tokens` / `LLAMA_PREFILL_BUDGET`) was turned +on for both models. It answers three questions: how much of the apparent gap was prefill +starvation, what genuine gap to vLLM remains after that artifact is removed, and where that leaves +the "par-or-beat vLLM" goal. + +### 1. How much did patch 0013 close the gap? + +The original (stock) tables blamed two things on llama: an exploding TTFT and a flat decode curve +at high concurrency. The budget re-run shows these were **two different problems with two +different root causes**, and only one was prefill starvation. + +**Dense 27B - was genuinely prefill-starved.** Dense prefill is expensive (full 28B weights per +token), so 128 simultaneous 512-token prefills truly starved both first-tokens and decode. Budget +256 @npl128: + +| metric @npl128 | stock | budget 256 | vLLM | what closed | +|----------------|------:|-----------:|-----:|-------------| +| TTFT mean | 491.2 s | **305.4 s** (-37.8%) | 24.8 s | starvation real; -186 s recovered | +| decode_agg | 134.6 | **161.2** (+19.8%) | 390.7 | freed slots now decode | +| llama as % of vLLM decode | 34.5% | **41.3%** | 100% | +6.8 pts | + +Dense llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **99 / 56 / 46 / 41** (was 99/57/44/34). +The fix moved only the saturated tail; npl 8/32 were never starved and are unchanged. + +**MoE 35B-A3B - was NOT prefill-starved (the inversion).** Only ~3B active params, so prefill was +already cheap and stock TTFT @npl128 was 84.8 s, not dense's 491 s. There was no starvation to +rescue, so the budget could not cut TTFT - it instead converted deferred prefill into decode +steps. Budget 256 @npl128: + +| metric @npl128 | stock | budget 256 | vLLM | direction | +|----------------|------:|-----------:|-----:|-----------| +| TTFT mean | 84.8 s | 98.1 s (+15.7%, WORSE) | 7.98 s | budget costs latency here | +| decode_agg | 292.2 | **333.5** (+14.1%) | 811.1 | plateau removed | +| llama as % of vLLM decode | 36.0% | **41.1%** | 100% | +5.1 pts | + +MoE llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **84 / 52 / 44 / 41** (was 84/51/44/36). +The decisive MoE finding is the scaling curve, not the point: stock decode plateaued over the last +doubling (64->128 = +7.4%); budget 256 restored monotonic scaling (+20.4%), proving the stock flat +curve was unbounded prefill stealing steps from ready decode slots, not a kernel ceiling. + +**Combined takeaway.** Both models converge to the **same ~41% of vLLM decode at npl128** after the +fix. That convergence is the signal: once prefill starvation is removed, dense and a 12x-cheaper- +prefill MoE land on the identical ceiling, which means the remaining gap is **not** about prefill +at all - it is the decode scheduler. + +### 2. The honest remaining gap to vLLM + +After patch 0013, the residual gap is the **continuous-batched-decode efficiency** lever, and it is +real, not an artifact: + +- vLLM still decodes **~2.4x faster** at npl128 on both models (390.7 vs 161.2 dense; 811.1 vs + 333.5 MoE). +- vLLM holds TTFT **~12x lower** at npl128 (24.8 vs 30.5 s dense; 8.0 vs 98.1 s MoE) - and does so + while decoding faster, i.e. no latency/throughput trade. +- **vLLM scales monotonically and steeply** (dense 64->391, MoE 202->811 across npl 8->128); llama, + even with the budget, only **partially** recovers its scaling (dense 64->161, MoE 170->334). + +The mechanism: vLLM's scheduler interleaves prefill and decode at token granularity (chunked +prefill + paged continuous batching) every step, keeping the GPU saturated with a near-optimal mix. +Patch 0013 is a coarser tool - a static per-step prefill **cap** - which protects in-flight decode +but does not actively schedule the prefill/decode mix, and on the bursty all-at-once harness it +defers first tokens (the TTFT penalty at npl 8/32/64, and the MoE TTFT regression @npl128). The gap +that remains is the **quality of the step-by-step batching decision**, not raw kernel speed: at +npl8 the kernels are at parity (dense 99%, MoE 84%), so the per-token math is competitive - what +vLLM does better is keeping more sequences productively in-flight every step as concurrency rises. + +### 3. Where this leaves "par-or-beat vLLM", and the last lever + +**Where llama is competitive today (NVFP4, GB10):** + +- **Low concurrency (npl<=8): at parity.** Dense 99%, MoE 84% of vLLM decode, comparable TTFT. + For single-user / few-stream local serving - LocalAI's dominant mode - llama.cpp is already + there on matched NVFP4. +- **Memory efficiency: llama wins outright at every concurrency.** On-demand paged KV (dense + 52->94 GB, MoE 39->61 GB) vs vLLM's flat ~112 GB pre-reservation. On a 128 GB unified box this is + the difference between multi-tenant headroom and OOM - a genuine product advantage, not a + consolation. + +**Where llama is not competitive:** high-concurrency decode throughput (npl>=32), where vLLM is +~2-2.4x ahead and the budget only narrows it to ~41%. + +**The last lever** is therefore *not* another prefill knob (0013 has extracted what a static cap +can give) and *not* the kernel (at parity @npl8). It is **token-granular continuous-batch +scheduling**: actively interleaving chunked prefill with decode every step rather than capping +prefill, so all live slots decode while new prefills trickle in - exactly what closes vLLM's +monotonic-scaling advantage. A staggered (non-burst) arrival pattern would also let 0013 protect +decode jitter without the burst-TTFT penalty seen here, narrowing the practical gap for real +serving traffic that does not arrive all-at-once. + +### Bottom line + +Patch 0013 is validated and worth shipping as a **selective, high-concurrency QoS lever**: it +recovers dense TTFT 38% and lifts saturated decode +14-20%, converging both models to ~41% of +vLLM. But it is honestly **not a gap-closer**. The "par-or-beat vLLM" goal is **met at low +concurrency and on memory efficiency, and not met at high-concurrency decode throughput.** The +remaining ~2.4x is a continuous-batched-decode scheduling gap, not a prefill-starvation or kernel +gap - and that is the next (harder) lever, distinct from anything 0013 can touch.