mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-24 08:38:51 -04:00
docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget
The prior all-at-once BURST H2H is adversarial to any prefill budget (TTFT is prefill-rate-bound, a cap only slows the drain) and showed 0016 ~= 0013. Run a STAGGERED-arrival benchmark on the GB10 DGX (patch 0016 built @253cbae): a steady-rate client that keeps a mix of in-flight decoders + newly-arriving prefills, capturing per-request TTFT and the full inter-token-latency series. Append the metrics (in-flight decode protection + new-request TTFT, per arm) and an honest verdict to P1_DYNAMIC_BUDGET_RESULTS.md. On staggered traffic stock's in-flight decoders freeze multi-second on every prefill admission while both budget arms keep ITL flat; 0016 (mbt512) sits at a strictly better point on the protection/TTFT frontier than 0013-256 (equal spike-free protection, materially lower TTFT/throughput/wall) and adds a decode-adaptive single-T knob. It does not strictly dominate stock (Pareto tradeoff: smoothness vs raw TTFT). Verdict: 0016 earns its keep over 0013 on staggered traffic; recommend LLAMA_MAX_BATCH_TOKENS=512. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -160,3 +160,146 @@ supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
|
||||
and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
|
||||
**pending DGX access** and must be run before this is sold on numbers. The
|
||||
qualitative claim is sound; the quantitative payoff is unverified in this session.
|
||||
|
||||
## Staggered-arrival evaluation
|
||||
|
||||
Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
|
||||
0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
|
||||
requests at t=0) is structurally adversarial to *any* prefill budget: under a
|
||||
burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
|
||||
drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
|
||||
trickle in while others are already decoding) is the regime 0016 is designed for:
|
||||
when a new prefill arrives, the decode-first budget should keep the
|
||||
already-decoding slots flowing (low/flat inter-token latency) while the new
|
||||
prefill takes only the leftover `T - D`. This section measures exactly that.
|
||||
|
||||
### Harness (staggered client, dev-tree-only)
|
||||
|
||||
`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
|
||||
all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
|
||||
`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
|
||||
prefix caching). It records, per request, the send time, the TTFT, and the
|
||||
absolute timestamp of **every** generated token (full ITL series); raw dumps go to
|
||||
`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
|
||||
**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
|
||||
99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
|
||||
`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
|
||||
at the default max=32, identical across all arms). Three to four arms per model,
|
||||
**env-only** difference, sequenced on the single GPU with PID-file stop between
|
||||
arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
|
||||
dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
|
||||
|
||||
**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
|
||||
ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
|
||||
by already-decoding slots **while new prefills are still arriving** -> the
|
||||
decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
|
||||
exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
|
||||
first-token latency per newly-arriving request. *decode agg* = total generated /
|
||||
decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
|
||||
is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
|
||||
merit). *wall* = last token - first send.
|
||||
|
||||
### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
|
||||
|
||||
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|
||||
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
|
||||
| stock | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
|
||||
| 0013 (pb256) | 527 / 640 / 650 | 0 / 0 | 44763 / 90338 | 81.2 | 201.8 |
|
||||
| 0016 (mbt512) | 730 / 897 / 901 | 0 / 0 | 33320 / 66595 | 88.4 | 185.8 |
|
||||
| 0016 (mbt1024) | 1320 / 2050 / 2051 | 46 / 5 | 33402 / 62636 | 72.4 | 226.8 |
|
||||
|
||||
**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
|
||||
admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
|
||||
(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
|
||||
`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
|
||||
**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
|
||||
protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
|
||||
freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
|
||||
protection/TTFT dial.
|
||||
|
||||
### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
|
||||
|
||||
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|
||||
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
|
||||
| stock | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
|
||||
| 0013 (pb256) | 443 / 572 / 607 | 0 / 0 | 18608 / 38347 | 38.0 | 54.7 |
|
||||
| 0016 (mbt512) | 597 / 858 / 863 | 0 / 0 | 14506 / 28055 | 43.9 | 47.4 |
|
||||
|
||||
Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
|
||||
arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
|
||||
wall at equal protection.
|
||||
|
||||
### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
|
||||
|
||||
| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
|
||||
|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
|
||||
| stock | 706 / 1146 / 1148 | 132 / 0 | 2774 / 5105 | 202.4 | 81.1 |
|
||||
| 0013 (pb256) | 194 / 273 / 280 | 0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
|
||||
| 0016 (mbt512) | 275 / 366 / 373 | 0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
|
||||
|
||||
MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
|
||||
prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
|
||||
present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
|
||||
`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
|
||||
throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
|
||||
TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
|
||||
visible here.
|
||||
|
||||
### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
|
||||
|
||||
At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
|
||||
degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
|
||||
token lands inside the window, so the in-window protection metric is empty). This
|
||||
reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
|
||||
mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
|
||||
budget ~= stock, stock marginally better on TTFT and throughput. This is the
|
||||
control, not 0016's target regime.
|
||||
|
||||
### Structural note (intellectual honesty)
|
||||
|
||||
At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
|
||||
**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
|
||||
floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
|
||||
therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
|
||||
genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
|
||||
prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
|
||||
settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
|
||||
dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
|
||||
TTFT. The distinctive value of the dynamic budget is the **safety property**: it
|
||||
lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
|
||||
count auto-shrinks so decode is never starved when load rises - which is precisely
|
||||
what stock lacks (stock = unbounded prefill chunk = the freezes).
|
||||
|
||||
### Verdict (honest)
|
||||
|
||||
- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
|
||||
vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
|
||||
already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
|
||||
in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
|
||||
(0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
|
||||
real P1 win and it shows **only** under staggered arrival, never under the burst.
|
||||
- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
|
||||
across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
|
||||
precisely because it lets prefill stampede the decoders (that stampede *is* the
|
||||
freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
|
||||
lunch; this does not manufacture a TTFT-beats-stock claim.
|
||||
- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
|
||||
It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
|
||||
strictly better point on the protection/TTFT frontier than 0013-256 (equal
|
||||
spike-free protection, materially lower TTFT/throughput/wall), and it adds a
|
||||
principled, decode-adaptive, single-`T` way to move along that frontier (one
|
||||
config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
|
||||
strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
|
||||
stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
|
||||
(staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
|
||||
from prefill-interleaving cost, not a kernel difference) - the P1 win is
|
||||
latency-under-load, as expected.
|
||||
|
||||
**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
|
||||
spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
|
||||
a decode-adaptive knob that holds one config across loads and model types. Against
|
||||
stock it is a deliberately different operating point that trades a few seconds of
|
||||
new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
|
||||
avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
|
||||
protective setting and higher `T` when low-load TTFT matters more than ITL
|
||||
flatness.
|
||||
|
||||
Reference in New Issue
Block a user