docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget

The prior all-at-once BURST H2H is adversarial to any prefill budget (TTFT is
prefill-rate-bound, a cap only slows the drain) and showed 0016 ~= 0013. Run a
STAGGERED-arrival benchmark on the GB10 DGX (patch 0016 built @253cbae): a
steady-rate client that keeps a mix of in-flight decoders + newly-arriving
prefills, capturing per-request TTFT and the full inter-token-latency series.

Append the metrics (in-flight decode protection + new-request TTFT, per arm) and
an honest verdict to P1_DYNAMIC_BUDGET_RESULTS.md. On staggered traffic stock's
in-flight decoders freeze multi-second on every prefill admission while both
budget arms keep ITL flat; 0016 (mbt512) sits at a strictly better point on the
protection/TTFT frontier than 0013-256 (equal spike-free protection, materially
lower TTFT/throughput/wall) and adds a decode-adaptive single-T knob. It does not
strictly dominate stock (Pareto tradeoff: smoothness vs raw TTFT). Verdict: 0016
earns its keep over 0013 on staggered traffic; recommend LLAMA_MAX_BATCH_TOKENS=512.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-24 10:56:13 +00:00
parent 24ce7d0823
commit f7500df64e

View File

@@ -160,3 +160,146 @@ supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
**pending DGX access** and must be run before this is sold on numbers. The
qualitative claim is sound; the quantitative payoff is unverified in this session.
## Staggered-arrival evaluation
Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
requests at t=0) is structurally adversarial to *any* prefill budget: under a
burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
trickle in while others are already decoding) is the regime 0016 is designed for:
when a new prefill arrives, the decode-first budget should keep the
already-decoding slots flowing (low/flat inter-token latency) while the new
prefill takes only the leftover `T - D`. This section measures exactly that.
### Harness (staggered client, dev-tree-only)
`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
prefix caching). It records, per request, the send time, the TTFT, and the
absolute timestamp of **every** generated token (full ITL series); raw dumps go to
`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
at the default max=32, identical across all arms). Three to four arms per model,
**env-only** difference, sequenced on the single GPU with PID-file stop between
arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
by already-decoding slots **while new prefills are still arriving** -> the
decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
first-token latency per newly-arriving request. *decode agg* = total generated /
decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
merit). *wall* = last token - first send.
### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
| 0013 (pb256) | 527 / 640 / 650 | 0 / 0 | 44763 / 90338 | 81.2 | 201.8 |
| 0016 (mbt512) | 730 / 897 / 901 | 0 / 0 | 33320 / 66595 | 88.4 | 185.8 |
| 0016 (mbt1024) | 1320 / 2050 / 2051 | 46 / 5 | 33402 / 62636 | 72.4 | 226.8 |
**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
protection/TTFT dial.
### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
| 0013 (pb256) | 443 / 572 / 607 | 0 / 0 | 18608 / 38347 | 38.0 | 54.7 |
| 0016 (mbt512) | 597 / 858 / 863 | 0 / 0 | 14506 / 28055 | 43.9 | 47.4 |
Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
wall at equal protection.
### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
| stock | 706 / 1146 / 1148 | 132 / 0 | 2774 / 5105 | 202.4 | 81.1 |
| 0013 (pb256) | 194 / 273 / 280 | 0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
| 0016 (mbt512) | 275 / 366 / 373 | 0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
visible here.
### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
token lands inside the window, so the in-window protection metric is empty). This
reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
budget ~= stock, stock marginally better on TTFT and throughput. This is the
control, not 0016's target regime.
### Structural note (intellectual honesty)
At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
TTFT. The distinctive value of the dynamic budget is the **safety property**: it
lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
count auto-shrinks so decode is never starved when load rises - which is precisely
what stock lacks (stock = unbounded prefill chunk = the freezes).
### Verdict (honest)
- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
(0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
real P1 win and it shows **only** under staggered arrival, never under the burst.
- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
precisely because it lets prefill stampede the decoders (that stampede *is* the
freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
lunch; this does not manufacture a TTFT-beats-stock claim.
- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
strictly better point on the protection/TTFT frontier than 0013-256 (equal
spike-free protection, materially lower TTFT/throughput/wall), and it adds a
principled, decode-adaptive, single-`T` way to move along that frontier (one
config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
(staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
from prefill-interleaving cost, not a kernel difference) - the P1 win is
latency-under-load, as expected.
**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
a decode-adaptive knob that holds one config across loads and model types. Against
stock it is a deliberately different operating point that trades a few seconds of
new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
protective setting and higher `T` when low-load TTFT matters more than ITL
flatness.