From ed17fc804e6870cc42fa34678b060c65cf7948f4 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 22:36:15 +0000 Subject: [PATCH] docs(paged): scope token-granular continuous-batch scheduler for llama-server Build-ready plan (not implemented) for a vLLM-v1-style token-granular continuous-batch scheduler in tools/server/server-context.cpp update_slots(), the last lever after patch 0013 on the GB10 NVFP4 llama-vs-vLLM gap. Key findings that shape the scope: - The unified mixed batch already exists: Phase 1 (2604-2719) claims every ready decode token unconditionally, Phase 2 (2753-3330) fills prefill into the same llama_batch. Decode-first is structural, not a thing to build. - The chunked-prefill slot state already persists across steps (a PROCESSING_PROMPT slot with prompt.n_tokens() < task->n_tokens() resumes). No slot-state rewrite is needed - the feared big risk does not materialize. - The only missing piece is the budget POLICY: convert 0013's static per-step prefill cap into a dynamic, decode-first, per-slot-fair token budget (one total T, decode claims D, prefill gets leftover T-D, capped per slot). - Honest ceiling: the residual ~2.4x decode gap is a decode-KERNEL batch scaling ceiling (~157-161 dense / ~333 MoE @npl128), NOT a scheduler defect. The scheduler closes the 12x TTFT gap and holds that ceiling tuning-free; the throughput residual is a separate, named decode-kernel lever (P3). Phased P0-P3 with per-phase payoff, files, risks, and GB10 considerations. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md | 375 ++++++++++++++++++ 1 file changed, 375 insertions(+) create mode 100644 backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md diff --git a/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md b/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md new file mode 100644 index 000000000..c1030c5e7 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md @@ -0,0 +1,375 @@ +# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10 + +Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This +document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style +token-granular continuous-batch scheduler, and records the single honest finding that +re-shapes what the change can and cannot buy. + +Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified +LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`), +MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev` +(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`. +Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that +`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`). + +## TL;DR (the honest reframe) + +Three findings, read directly from the source at HEAD `151343b` and from the committed +NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset +what it is allowed to claim: + +1. **The unified mixed batch already exists.** `update_slots()` already builds ONE + `llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill + tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every + `SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then + Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode + is therefore **already claimed first and never dropped or capped** - the exact property + vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction. + +2. **The chunked-prefill slot state already exists and already persists across steps.** A + slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()` + is a partial prefill; it stays in that state and resumes next step until its prompt is + fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING` + (line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING` + simultaneously; there is **no global "one prefill at a time" gate**. So the mission's + "allow a slot to be mid-prefill while others decode in the same step" is **not a state + machine to build - it is already the behaviour.** This is the single biggest de-risking + fact in this document. + +3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013 + (`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by + slots in iteration order. It is not decode-load-aware (does not subtract the live decode + count `D`), not adaptive (one constant across npl 8..128), and not fair (the first + `PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that + static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total + per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover** + `T - D` distributed across waiting prompts with a per-slot cap. That is ~the only + behavioural change. **No new slot states, no batch-formation rewrite.** + +### The honest ceiling (this is load-bearing for how the work is scoped and sold) + +The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus +`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a +decode-KERNEL batch-scaling ceiling, not a scheduler defect**: + +- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode). +- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero + prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same + ~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the + same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The + scheduler cannot lift this.** +- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular + scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once + workload. + +Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is: + +- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on + staggered arrival) - the genuine, large win. +- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without + per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE + TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole + npl range and across dense vs MoE. +- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the + burst-TTFT spread), and no admission collapse under sustained load. + +Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the +paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or +CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's +scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness + +self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids. + +## The gap, precisely localized (recap of the committed bench) + +At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0, +decode_agg tok/s | TTFT mean, npl swept 8/32/64/128: + +| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM | +|----:|------------------------:|-----------:|----------------------:|---------:| +| 8 | 63.5 / 4.3 s | 64.3 / 2.6 s | 169.3 / 1.7 s | 202.0 / 0.8 s | +| 32 | 105.7 / 23.1 s | 189.8 / 7.5 s | 239.0 / 9.0 s | 462.0 / 2.3 s | +| 64 | 132.0 / 109 s | 284.2 / 13 s | 277.0 / 16.2 s | 624.5 / 4.1 s | +| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s | + +Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That +convergence is the signal: once prefill starvation is removed, a dense model and a +12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill** +and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching +decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual). +This scope addresses the first; it names the second as the separate lever. + +## What already exists (reuse, do NOT rebuild) + +All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`. + +- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every + `SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to + `generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends + its sampled token (+ draft tokens) via `common_batch_add`. After this loop, + `batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in. +- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (= + `num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT` + slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it). + Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at + line 3502. **This is exactly vLLM's "leave the request in `running`, advance + `num_computed_tokens` next step" - already implemented.** +- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill; + the compute loop (lines ~3366-3378) `for (i=0; i all decode claimed before +any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each +chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps: +`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len, +per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch +maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded +progress (-> low, flat TTFT), one `model.forward()` per step. + +The mapping to llama is clean because [A]+[B] already give us "running visited first" and +"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D` +sizing, and the per-request chunk cap with fair distribution.** + +## The unified per-step batch-formation algorithm (the design) + +New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`): + +- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token + budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch` + (2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is + unchanged. +- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per + step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`, + floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress. +- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave + exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case. + +Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today): + +``` +common_batch_clear(batch); // line 2594 + +// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719) +for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot); +... speculative draft ... +for (slot : generating) slot.update_batch(batch); // appends decode (+draft) tokens + +>> D = batch.n_tokens; // NEW seam: decode load is now final (after 2719) +>> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch); +>> prefill_budget_step = max(0, T - D); // DYNAMIC leftover, auto-shrinks with D +>> prefill_cap_per_slot = PREFILL_CAP; // long_prefill_token_threshold analogue +>> n_prompt_budgeted = 0; // total prompt tokens added this step (subsumes 0013) + +// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair) +if (cont_batching || batch.n_tokens == 0) { +>> for (k = 0; k < n_slots; ++k) { // round-robin start offset (fairness, see P2) +>> slot = slots[(rr_start + k) % n_slots]; + if (!slot.is_processing() || !can_batch_with) continue; + if (slot.state == STARTED) slot.state = PROCESSING_PROMPT; // line 2782 (unchanged) +>> slot_prompt_added = 0; // NEW: per-slot chunk counter (reset each slot) + // inner prompt-fill (lines 3187-3239), guard now triple-bounded: + while (slot.prompt.n_tokens() < slot.task->n_tokens() +>> && batch.n_tokens < T // was: < n_batch +>> && n_prompt_budgeted < prefill_budget_step // was: 0013 static n_prefill_budget +>> && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution + common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd); + slot.prompt.tokens.push_back(cur_tok); + slot.n_prompt_tokens_processed++; + n_prompt_budgeted++; slot_prompt_added++; + ... checkpoint-boundary breaks (unchanged) ... + } + if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT; // line 3252 + ... checkpoint creation (unchanged) ... +>> if (batch.n_tokens >= T) break; // was: >= n_batch (line 3320) +>> if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326) + } +} + +for (i=0; i +bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec). +As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates +the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256 +reaches 161; here it is reached **automatically across the npl range** because the budget is +`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover +(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode- +saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes +bounded progress every step instead of waiting for a dedicated prefill burst. + +## Slot state machine changes (minimal - this is the headline de-risk) + +**No new states. No state-transition rewrite.** The existing 6-state machine +(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72) +already encodes everything: + +- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING` + slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch). +- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() < + task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters + the slot and resumes from `prompt.n_tokens()`. + +The only **additions** are per-step scheduler scratch, not slot lifecycle state: + +1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for + the per-slot chunk cap. Not stored on the slot across steps. +2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the + leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always + draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without + it, slot 0's prompt finishes first every time and the last slots starve). +3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT` + transitions begin in one step. This falls out of the budget arithmetic already (a bounded + `prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it + may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a + new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots. + +That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap. +The mission's feared "slot-state rewrite" does not materialize. + +## How it supersedes / subsumes patch 0013 + +| property | 0013 (static cap) | this scheduler (dynamic `T - D`) | +|----------|-------------------|----------------------------------| +| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises | +| decode-load aware | no (ignores `D`) | yes (leftover after decode) | +| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) | +| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) | +| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts | +| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** | + +0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant +and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the +new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the +dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013. + +## Correctness + +- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already + advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the + chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add` + receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state. +- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by + the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded: + pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays + coherent and answers correctly. The op-level math per token is position-determined and + unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic + oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies + to end-to-end only, never to the op test. +- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed + by sequence position and slot/stream, which this change does not touch; it changes only which + tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates + per-token via the block tables regardless of what prefill tokens are co-batched. Required gate: + run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical + scheduling decisions** (same per-step token counts, same admission order) - paged must be a + no-op on the scheduler. +- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same + `task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant + LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types / + per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note + it, do not try to lift it here. +- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that + reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the + dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained + heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s, + zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step. + This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left + on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision + should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose + checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching + change. + +## Phased plan P0 -> P3 (work, payoff, files, risk) + +| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk | +|-------|------|-----------------------------------------------------|-------|------| +| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low | +| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) | +| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) | +| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) | + +**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):** + +- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on + staggered arrival it goes below the vLLM burst figure to ~2 s). +- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM, + the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate + lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the + ceiling, the kernel on the throughput residual. + +## GB10 considerations + +- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read + + paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only + keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single + `llama_decode`; `n_ubatch` (512) governs the internal split. +- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more + prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower + TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load. + Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving. +- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be + checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win. +- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat + ~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change. +- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the + scheduler's uniform decode-only step is a precondition that *helps* a future capture. + +## Biggest risks and how to de-risk + +1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others- + decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch + (`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states + untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock, + verified by an A/B diff of per-step token counts. +2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already + documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**: + op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with + the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical** + to 0013. +3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite + with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a + no-op on batch formation). This is a hard gate, not a spot check. +4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the + scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission + checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the + batching win is legible independent of the checkpoint co-defect. +5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the + work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**: + frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission + rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually + moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the + scheduler is never charged with the 391/811 number the kernel forbids. + +## Commit / hygiene + +Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity +scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are +implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch- +scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp` +beside `max_prefill_tokens`. Commit with `git commit -s`, trailer +`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push +(human pushes).