mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-30 03:17:01 -04:00
docs(paged): record padded/fixed-slot decode shape as tested-and-rejected
The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -147,8 +147,13 @@ mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the stat
|
||||
regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0
|
||||
profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input
|
||||
copy), so it has nothing to recover. The remaining ~28% serving rebuilds are
|
||||
request-boundary D/seq-set churn + the prefill-cadence steps; a padded/fixed-slot
|
||||
decode shape to capture them is scoped in `docs/DECODE_SERVING_SCOPE.md`.
|
||||
request-boundary D/seq-set churn + the prefill-cadence steps. A **padded/fixed-slot
|
||||
decode shape** to capture them was then implemented and GPU-tested (2026-06-28) and
|
||||
**REJECTED** - it is bit-exact/inert but regresses serving throughput at every
|
||||
concurrency, because this serving decode is GPU-compute-bound (baseline reuse 0% ~=
|
||||
S1+S3 reuse 72% on aggregate tok/s), so the dummy-row compute it adds costs more
|
||||
than the reuse it recovers. Full record + numbers in `docs/DECODE_SERVING_SCOPE.md`
|
||||
("Padded-shape lever - rejected").
|
||||
|
||||
### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
|
||||
|
||||
|
||||
@@ -22,14 +22,94 @@ graph rebuild; `set_inputs` 0.047 ms and block-table 0.002 ms are negligible.
|
||||
decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9).**
|
||||
- **S2 (double-buffer set_inputs) - DROPPED.** Phase 0 put `set_inputs` at
|
||||
~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover.
|
||||
- **Follow-up to ~100% reuse:** the remaining ~28% serving rebuilds are
|
||||
request-boundary D/seq-set churn + the S3 prefill-cadence steps. Capturing them
|
||||
needs a **padded/fixed-slot decode shape** (pad the decode width to a fixed
|
||||
bucket with masked-inert dummy slots so `n_tokens` and the seq-id set stay
|
||||
constant across arrivals/completions - the lever S1 section (a) describes).
|
||||
Deferred: S1+S3 already reach vLLM-parity on the mean; padding is server-side,
|
||||
invasive, and not exercised by the single-sequence md5 gate (needs a per-stream
|
||||
serving-determinism gate). It is the next lever, not a shipped one.
|
||||
- **Follow-up to ~100% reuse - PADDED/FIXED-SLOT DECODE SHAPE: IMPLEMENTED,
|
||||
GPU-TESTED, REJECTED (not shipped).** See the "Padded-shape lever - rejected"
|
||||
block below. Summary: it does NOT close the serving gap. Padding holds the
|
||||
pure-decode width constant by emitting masked-inert dummy decodes for idle
|
||||
slots, and it is provably inert (single-seq md5 bit-exact + per-stream
|
||||
noise-floor determinism), but it **regresses throughput at every concurrency**
|
||||
(catastrophically at low load) because the serving decode here is
|
||||
**GPU-compute-bound, not host-rebuild-bound** - so the dummy-row compute it adds
|
||||
costs more than the graph-reuse it recovers. The original "remaining ~28% is
|
||||
request-boundary churn -> pad it" hypothesis stands mechanically, but the payoff
|
||||
premise (closing reuse pulls decode toward vLLM) is **not supported by
|
||||
measurement**.
|
||||
|
||||
---
|
||||
|
||||
## Padded-shape lever - rejected (implemented + GPU-tested, 2026-06-28)
|
||||
|
||||
The S1 section-(a) **padded / fixed-slot decode shape** was implemented in an
|
||||
isolated worktree off the committed S1/S3/tail base (paged HEAD `05eceb4`), built
|
||||
CUDA-only, and benched on GB10. **Verdict: REJECTED - it regresses serving
|
||||
throughput and does not close the vLLM gap.** Recorded here so it is not re-tried.
|
||||
|
||||
**Implementation** (default-off, `LLAMA_PAGED_PAD_DECODE=1`; `LLAMA_PAGED_PAD_WIDTH`
|
||||
caps the slot range): at the end of `pre_decode()`, on any step where no prompt
|
||||
tokens were admitted (`n_prompt_budgeted == 0`) and there is decode load, emit a
|
||||
masked-inert dummy decode for **every IDLE slot** (`batch.add(slot.id, 0,
|
||||
pos_max+1, /*output=*/true)`; cold slot -> fresh pos-0). This holds `n_tokens`,
|
||||
`n_seqs`, `n_seqs_unq`, `n_outputs` and the participating seq-id SET constant
|
||||
across arrivals/completions. A `release()`-side guard keeps a finished slot warm
|
||||
under padding (else patch 0024's reclaim-on-idle frees its KV and the next-step
|
||||
pos-0 re-warm churns paged-block allocation, destroying reuse). Each dummy is its
|
||||
OWN sequence, so its recurrent (gated-DeltaNet) state is private and its paged
|
||||
attention reads only its own cells; its logits are computed but never read
|
||||
(`post_decode()` only consumes `slot.i_batch` of GENERATING slots).
|
||||
|
||||
**Gates.** (1) Single-seq greedy md5 **bit-exact PASS** - dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, paged-MoE `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
(the lever lives only in `llama-server`'s `update_slots()`, never in
|
||||
`llama-completion`). (2) **Per-stream serving determinism**: the literal
|
||||
"ON-vs-OFF token sequences identical" gate is **unachievable** - concurrent
|
||||
cuBLAS/FA decode is **not bit-reproducible run-to-run** even with padding OFF
|
||||
(OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16, lockstep K=16). The
|
||||
**achievable inertness gate PASSED**: per-stream prefix-agreement ON-vs-OFF equals
|
||||
the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), i.e. the
|
||||
dummy slots inject no systematic divergence beyond the pre-existing concurrent FP
|
||||
noise. So padding is provably inert; it just does not help.
|
||||
|
||||
**Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10).** Burst h2h, decode tok/s/seq:
|
||||
|
||||
| n | S1+S3 | PAD | vLLM |
|
||||
|-----|-------|------|------|
|
||||
| 8 | 28.16 | 6.05 | 44.8 |
|
||||
| 32 | 11.66 | 4.84 | 17.45|
|
||||
| 64 | 7.16 | 4.33 | 11.07|
|
||||
| 128 | 4.53 | 4.32 | 6.87 |
|
||||
|
||||
Staggered (`serve_bench.py` k=128 n=160 stagger0.25), aggregate decode tok/s and
|
||||
graph-reuse: baseline (reuse 0%) **757.6**; S1+S3 (reuse 72%) **763.3**; **PAD
|
||||
(reuse 38%) 558.0**.
|
||||
|
||||
**Why it fails (four independent reasons):**
|
||||
|
||||
1. **Serving decode is GPU-compute-bound, not host-rebuild-bound (this run).**
|
||||
Baseline reuse 0% (757.6 agg) is statistically equal to S1+S3 reuse 72% (763.3
|
||||
agg): `hostproc` is only ~4-8% of the per-step wall, so eliminating the host
|
||||
graph rebuild buys ~nothing. (This **corrects the host-bound hypothesis** above
|
||||
for this hardware: the earlier 542->762 host-bound delta did **not** reproduce
|
||||
- it was GPU-state/contention variance, not a stable reuse effect.)
|
||||
2. **Padding ADDS dummy-row compute** (full-width decode), costing throughput in
|
||||
direct proportion to `pad_width - real_load`: catastrophic at low concurrency
|
||||
(n=8: 28.16 -> 6.05, ~4.6x slower, because 8 real streams pay for a 128-wide
|
||||
step).
|
||||
3. **In continuous serving padding can't even hold the width constant**: arrivals
|
||||
are perpetually mid-prefill, so the idle-slot count varies and reuse DROPS
|
||||
72% -> 38% (the opposite of the goal). It only stabilises the pure-decode
|
||||
*tail* of a burst (verified: width pinned at 64 as real decoders fell 49->5),
|
||||
which is exactly where the dummy compute is most wasteful.
|
||||
4. **The completion-driven batch shrink that padding prevents is itself a
|
||||
throughput WIN** in a compute-bound regime (fewer real streams -> cheaper
|
||||
steps -> survivors finish faster); forcing constant width forfeits it.
|
||||
|
||||
**Conclusion.** The residual burst gap (paged 4.53 vs vLLM 6.87 at n=128 ~= 66%)
|
||||
is a **GPU-compute** gap (vLLM's MoE decode kernel + scheduler are ~1.3x faster on
|
||||
aggregate), not a host-loop gap. A host-side graph-reuse lever cannot close it.
|
||||
Do not re-pursue padded/fixed-slot shapes for throughput; if the host loop is ever
|
||||
re-confirmed dominant on other hardware (re-run reason 1's baseline-vs-S1+S3 A/B
|
||||
first), revisit - but only with an *adaptive* width matched to live load, never a
|
||||
fixed pad-to-`--parallel`.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user