docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget

The prior all-at-once BURST H2H is adversarial to any prefill budget (TTFT is prefill-rate-bound, a cap only slows the drain) and showed 0016 ~= 0013. Run a STAGGERED-arrival benchmark on the GB10 DGX (patch 0016 built @253cbae): a steady-rate client that keeps a mix of in-flight decoders + newly-arriving prefills, capturing per-request TTFT and the full inter-token-latency series. Append the metrics (in-flight decode protection + new-request TTFT, per arm) and an honest verdict to P1_DYNAMIC_BUDGET_RESULTS.md. On staggered traffic stock's in-flight decoders freeze multi-second on every prefill admission while both budget arms keep ITL flat; 0016 (mbt512) sits at a strictly better point on the protection/TTFT frontier than 0013-256 (equal spike-free protection, materially lower TTFT/throughput/wall) and adds a decode-adaptive single-T knob. It does not strictly dominate stock (Pareto tradeoff: smoothness vs raw TTFT). Verdict: 0016 earns its keep over 0013 on staggered traffic; recommend LLAMA_MAX_BATCH_TOKENS=512. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-24 08:38:51 -04:00 · 2026-06-24 10:56:13 +00:00
parent 24ce7d0823
commit f7500df64e
1 changed files with 143 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
@@ -160,3 +160,146 @@ supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
 and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
 **pending DGX access** and must be run before this is sold on numbers. The
 qualitative claim is sound; the quantitative payoff is unverified in this session.
+
+## Staggered-arrival evaluation
+
+Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
+0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
+requests at t=0) is structurally adversarial to *any* prefill budget: under a
+burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
+drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
+trickle in while others are already decoding) is the regime 0016 is designed for:
+when a new prefill arrives, the decode-first budget should keep the
+already-decoding slots flowing (low/flat inter-token latency) while the new
+prefill takes only the leftover `T - D`. This section measures exactly that.
+
+### Harness (staggered client, dev-tree-only)
+
+`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
+all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
+`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
+prefix caching). It records, per request, the send time, the TTFT, and the
+absolute timestamp of **every** generated token (full ITL series); raw dumps go to
+`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
+**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
+99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
+`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
+at the default max=32, identical across all arms). Three to four arms per model,
+**env-only** difference, sequenced on the single GPU with PID-file stop between
+arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
+dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
+
+**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
+ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
+by already-decoding slots **while new prefills are still arriving** -> the
+decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
+exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
+first-token latency per newly-arriving request. *decode agg* = total generated /
+decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
+is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
+merit). *wall* = last token - first send.
+
+### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
+
+| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
+|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
+| stock            | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
+| 0013 (pb256)     |  527 /  640 /  650 |  0 /  0 | 44763 / 90338 | 81.2 | 201.8 |
+| 0016 (mbt512)    |  730 /  897 /  901 |  0 /  0 | 33320 / 66595 | 88.4 | 185.8 |
+| 0016 (mbt1024)   | 1320 / 2050 / 2051 | 46 /  5 | 33402 / 62636 | 72.4 | 226.8 |
+
+**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
+admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
+(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
+`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
+**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
+protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
+freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
+protection/TTFT dial.
+
+### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
+
+| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
+|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
+| stock         | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
+| 0013 (pb256)  | 443 /  572 /  607 |  0 /  0 | 18608 / 38347 | 38.0 | 54.7 |
+| 0016 (mbt512) | 597 /  858 /  863 |  0 /  0 | 14506 / 28055 | 43.9 | 47.4 |
+
+Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
+arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
+wall at equal protection.
+
+### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
+
+| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
+|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
+| stock         | 706 / 1146 / 1148 | 132 / 0 |  2774 /  5105 | 202.4 | 81.1 |
+| 0013 (pb256)  | 194 /  273 /  280 |   0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
+| 0016 (mbt512) | 275 /  366 /  373 |   0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
+
+MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
+prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
+present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
+`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
+throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
+TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
+visible here.
+
+### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
+
+At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
+degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
+token lands inside the window, so the in-window protection metric is empty). This
+reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
+mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
+budget ~= stock, stock marginally better on TTFT and throughput. This is the
+control, not 0016's target regime.
+
+### Structural note (intellectual honesty)
+
+At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
+**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
+floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
+therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
+genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
+prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
+settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
+dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
+TTFT. The distinctive value of the dynamic budget is the **safety property**: it
+lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
+count auto-shrinks so decode is never starved when load rises - which is precisely
+what stock lacks (stock = unbounded prefill chunk = the freezes).
+
+### Verdict (honest)
+
+- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
+  vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
+  already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
+  in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
+  (0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
+  real P1 win and it shows **only** under staggered arrival, never under the burst.
+- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
+  across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
+  precisely because it lets prefill stampede the decoders (that stampede *is* the
+  freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
+  lunch; this does not manufacture a TTFT-beats-stock claim.
+- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
+  It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
+  strictly better point on the protection/TTFT frontier than 0013-256 (equal
+  spike-free protection, materially lower TTFT/throughput/wall), and it adds a
+  principled, decode-adaptive, single-`T` way to move along that frontier (one
+  config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
+  strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
+  stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
+  (staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
+  from prefill-interleaving cost, not a kernel difference) - the P1 win is
+  latency-under-load, as expected.
+
+**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
+spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
+a decode-adaptive knob that holds one config across loads and model types. Against
+stock it is a deliberately different operating point that trades a few seconds of
+new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
+avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
+protective setting and higher `T` when low-load TTFT matters more than ITL
+flatness.