mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 04:16:56 -04:00
docs(paged): scope the continuous-serving decode gap (host-bound, design-only)
Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains ~5.9. Scope shows the gap is the scheduler/host loop, not the kernel. Root-cause hypothesis from source: continuous batching's batch-shape + seq-set churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/ allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, NOT decode-step graph reuse, which is why the serving gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was static-regime and is reopened here for serving only. Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/ get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel code; no GPU run in this pass. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
310
backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md
Normal file
310
backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md
Normal file
@@ -0,0 +1,310 @@
|
||||
# DECODE_SERVING_SCOPE - the continuous-serving decode gap (design only)
|
||||
|
||||
**Status: DESIGN + SCOPE + RANKED LEVER PLAN ONLY. No kernel written, no GPU
|
||||
run in this pass (the GPU was busy with prefill agents).** Per the
|
||||
"profile-don't-assume" rule in
|
||||
[`.agents/vllm-parity-methodology.md`](../../../../.agents/vllm-parity-methodology.md),
|
||||
**Phase 0 (section 5) is to confirm the bottleneck on GPU before touching any
|
||||
code.** Everything below the Phase-0 line is a hypothesis ranked by
|
||||
value/effort/risk, not a measured result.
|
||||
|
||||
> **Regime warning (read first).** Every "decode is at the BW floor / ties vLLM"
|
||||
> and "host scheduling loop is the structural residual" conclusion in
|
||||
> [`README.md`](../README.md) section 5 was measured with **`llama-batched-bench`**:
|
||||
> a STATIC serving width (fixed `npl`, all sequences in lockstep, constant
|
||||
> batch shape every step). That is the **decode KERNEL** regime, and there the
|
||||
> patch series is at parity (paged ~6.1 tok/s/seq vs vLLM ~5.9 at npl128). This
|
||||
> document is about a **different regime**: real **continuous SERVING** through
|
||||
> `llama-server`'s `update_slots()` loop, where requests arrive and complete
|
||||
> asynchronously, the batch shape churns every step, and paged drops to ~3.7
|
||||
> tok/s/seq (-39%) while vLLM sustains ~5.9. The gap is the **scheduler / host
|
||||
> loop**, not the kernel. This is the serving analogue of the prefill-GEMM regime
|
||||
> split called out in [`PREFILL_GEMM_SCOPE.md`](PREFILL_GEMM_SCOPE.md).
|
||||
|
||||
Cross-links: [`README.md`](../README.md) sections 2 (scheduler), 3 (patches
|
||||
0008/0013/0016/0024/0025/0029), 5 (rejected levers - lever 2 graph coverage was
|
||||
FLAT *in the static regime*; this doc reopens it for the *serving* regime);
|
||||
[`.agents/llama-cpp-localai-paged-backend.md`](../../../../.agents/llama-cpp-localai-paged-backend.md)
|
||||
(bit-exact gate);
|
||||
[`.agents/vllm-parity-methodology.md`](../../../../.agents/vllm-parity-methodology.md)
|
||||
(both-engine ground-truth, per-lever A/B, record-rejected-levers).
|
||||
|
||||
---
|
||||
|
||||
## 1. The two regimes, and why the kernel-parity result does not carry over
|
||||
|
||||
`llama-batched-bench` and a real serving workload exercise the **same decode
|
||||
kernels** but **different host loops**:
|
||||
|
||||
| | `llama-batched-bench` (kernel regime) | `llama-server` continuous serving |
|
||||
|---|---|---|
|
||||
| batch shape per step | **constant** (fixed `npl`, lockstep) | **churns** (arrivals/completions, interleaved prefill) |
|
||||
| participating seq-set | **fixed** for the whole run | **changes** as requests start/finish |
|
||||
| graph reuse (see s.2) | holds after warmup -> 1 capture, replayed | breaks nearly every step -> rebuild + re-capture |
|
||||
| measured | paged ~6.1 tok/s/seq ~ vLLM ~5.9 | paged ~3.7 vs vLLM ~5.9 (-39%) |
|
||||
|
||||
The README's decode parity, BW-floor, and "host loop is the irreducible
|
||||
residual" findings are all **kernel-regime** findings. They prove the *kernels*
|
||||
are not the serving gap. They do **not** prove the host loop is irreducible *in
|
||||
serving* - the static bench holds the batch shape constant, which is exactly the
|
||||
condition that lets both graph-reuse layers (section 2) stay hot. Serving
|
||||
violates that condition. So the serving gap is reopened here as a host /
|
||||
scheduler problem, orthogonal to the kernel.
|
||||
|
||||
---
|
||||
|
||||
## 2. Root-cause hypothesis (from source, pin `9d5d882d` + the dev tree)
|
||||
|
||||
There are **two independent graph-reuse layers**, and continuous batching breaks
|
||||
**both** on nearly every step. This is the leading hypothesis for the -39%.
|
||||
|
||||
### 2a. Layer A - llama-context graph reuse (`can_reuse` / `allow_reuse`)
|
||||
|
||||
`llama_context::process_ubatch` (`src/llama-context.cpp` ~L1366) only **reuses
|
||||
the built ggml graph** when `res->can_reuse(gparams)` holds. `allow_reuse`
|
||||
(`src/llama-graph.h` ~L631) requires, among others:
|
||||
|
||||
```
|
||||
ubatch.n_tokens == other.ubatch.n_tokens &&
|
||||
ubatch.n_seqs == other.ubatch.n_seqs &&
|
||||
ubatch.n_seqs_unq == other.ubatch.n_seqs_unq &&
|
||||
ubatch.equal_seqs() == other.ubatch.equal_seqs()
|
||||
// + (when equal_seqs) the participating sequence-id SET must match
|
||||
```
|
||||
|
||||
In serving, `n_tokens` changes whenever the decode load `D` changes or a prefill
|
||||
chunk is co-batched, and the **sequence-id set** changes whenever a request
|
||||
starts or finishes. Either makes `can_reuse` return false, so `process_ubatch`
|
||||
falls into the `else` branch: **rebuild the graph** (`model.build_graph`) +
|
||||
`ggml_backend_sched_reset` + `ggml_backend_sched_alloc_graph` - full host-side
|
||||
graph construction + allocation, **every step**. In batched-bench all sequences
|
||||
are lockstep so `n_tokens`/seq-set are constant and `can_reuse` is true after
|
||||
warmup (the `graphs reused = N` perf line is ~all steps).
|
||||
|
||||
### 2b. Layer B - CUDA graph capture (`ggml_cuda_graph_*`)
|
||||
|
||||
Even when layer A reuses, the CUDA backend re-checks
|
||||
`ggml_cuda_graph_update_required` (`ggml-cuda.cu` ~L3367): it `memcmp`s every
|
||||
node's `ne`, `nb`, and `src[]->data` pointers against the captured graph. Any
|
||||
shape change -> `cudaGraphExecUpdate` / re-instantiate. Two serving-specific
|
||||
triggers:
|
||||
|
||||
- **shape churn** (same root cause as layer A): different `n_tokens` -> different
|
||||
node `ne` -> update required.
|
||||
- **paged data-pointer churn**: when a co-batched prefill allocates new KV blocks
|
||||
(or a finished sequence frees them), the per-step KV view tensors' `data`
|
||||
pointers move, so even a constant-shape decode step can trip the `memcmp`. (The
|
||||
block-table *contents* live in a fixed device buffer filled by `set_inputs`, so
|
||||
the table tensor pointer itself is stable - 0029 keeps that cheap - but the K/V
|
||||
cache views are not.)
|
||||
|
||||
Net: under serving, the GPU sits idle between launches while the host rebuilds
|
||||
the graph (layer A) and re-instantiates the CUDA graph (layer B), then runs an
|
||||
un-graphed `set_inputs` (H2D input copies) before each launch. vLLM avoids this
|
||||
with **padded/bucketed decode batch shapes + piecewise CUDA graphs**: it pads the
|
||||
decode batch to a fixed set of sizes and captures one persistent graph per
|
||||
bucket, so the steady-state decode step is a single `cudaGraphLaunch` with no
|
||||
host rebuild. Its scheduler is also a tight C++ loop with chunked-prefill
|
||||
interleave that keeps the GPU fed.
|
||||
|
||||
### 2c. Per-step host work that runs un-graphed regardless (already instrumented)
|
||||
|
||||
The dev tree carries a built-in `[L5INSTR]` profiler (`src/paged-attn.cpp`,
|
||||
hooks in `src/llama-context.cpp` and `src/llama-kv-cache.cpp`) that already
|
||||
isolates the host buckets we care about, printed at process exit:
|
||||
|
||||
```
|
||||
[L5INSTR] get_block_table n=.. sum=..ms mean=..ms | set_inputs n=.. mean=..ms | hostproc n=.. mean=..ms
|
||||
```
|
||||
|
||||
- `hostproc` = `mctx->apply()` + graph reuse-check/rebuild + `set_inputs`, i.e.
|
||||
the whole host window **before** `graph_compute` (it does NOT include the GPU
|
||||
launch). Prior profiles put this near ~1.4 ms/step.
|
||||
- `set_inputs` = the H2D input fills (positions, masks, block table, idxs).
|
||||
- `get_block_table` = the paged block-table host build (0029 caches it
|
||||
within-step; `LLAMA_PAGED_NO_BT_CACHE` A/B-toggles that).
|
||||
|
||||
If `hostproc` per step is a large fraction of the serving per-step wall time
|
||||
(and the `graphs reused` count is low), the gap is host-bound, not kernel-bound.
|
||||
|
||||
### 2d. The serial-SSM host loop (named in README s.5, secondary here)
|
||||
|
||||
The gated-DeltaNet decode advances recurrent state per step; sampling cannot
|
||||
start until logits land. The README already names this as a structural floor in
|
||||
the *kernel* regime. It is the same in serving but is the *smaller* term - the
|
||||
graph-rebuild/re-capture overhead (2a/2b) is the new, serving-specific cost the
|
||||
static bench hides, and it is the one to attack first.
|
||||
|
||||
---
|
||||
|
||||
## 3. What the already-shipped scheduler patches do (and do NOT do)
|
||||
|
||||
These exist; understand them before proposing anything. **None of them touch the
|
||||
two graph-reuse layers** - they target prefill freezing and burst collapse, not
|
||||
steady-state decode-step host overhead. That is why the serving gap survives them.
|
||||
|
||||
| Patch | What it does | What it does NOT do |
|
||||
|---|---|---|
|
||||
| 0008 cross-request prefix-share (server loop) | Concurrent shared-prefix requests prefill only the divergent suffix (fewer prefill tokens). | Does not stabilise decode batch shape; does not graph-reuse. |
|
||||
| 0013 `LLAMA_PREFILL_BUDGET` | Static per-step prefill-token cap (vLLM `--max-num-batched-tokens` analogue); flattens the ITL spike a long prefill inflicts on co-batched decode. | Ignores decode load; per-workload tuning; no effect on decode-step graph reuse. |
|
||||
| 0016 dynamic decode-first budget | `max(n_ubatch, T-D)` leftover-after-decode budget + per-slot chunk cap; decode claimed first, auto-shrinks as `D` rises. Stops a prefill chunk from inflating the step past `T`. | **Still lets the per-step decode `n_tokens` and seq-set vary**, so it does not make the decode step graph-reusable; it shapes prefill admission, not decode-shape stability. |
|
||||
| 0024 paged-pool burst-reclaim | Truncate/defrag/release KV blocks; fixes long-server prefill burst collapse (488->44->532 t/s). | Host accounting only; nothing about decode-step graph capture. |
|
||||
| 0025 `LLAMA_MOE_FORCE_GRAPHS` | Keeps CUDA graphs ON for the grouped-MMQ MoE decode step (lifts the conservative `MUL_MAT_ID` graph-disable). | Helps the CUDA-graph *eligibility* of one op; does **not** make layer-A/B *reuse* hold across churning steps. It is necessary-not-sufficient: a step that rebuilds anyway gets recaptured regardless. |
|
||||
| 0029 block-table within-step cache | `get_block_table` computed once per step, memcpy'd to other full-attn layers (-87/-91%). | Shrinks one `set_inputs`/`hostproc` sub-term; does not address rebuild/re-capture. |
|
||||
|
||||
**README s.5 "lever 2 (graph/stream coverage): FLAT"** was concluded **in the
|
||||
static batched-bench regime**, where graphs already reuse - so more graph
|
||||
coverage was correctly a no-op there. That conclusion does **not** apply to the
|
||||
serving regime, where graphs do **not** reuse. This doc reopens graph coverage
|
||||
**for serving only**; record it as a regime-scoped reopening, not a contradiction.
|
||||
|
||||
---
|
||||
|
||||
## 4. Ranked lever plan (hypotheses - gate on Phase 0 first)
|
||||
|
||||
Ranked by value/effort with bit-exactness/risk called out. All are **host-side /
|
||||
scheduler** levers (no decode-kernel changes), so all are *bit-exact-safe by
|
||||
construction* provided padding tokens are masked-inert and verified against the
|
||||
per-path md5 gate.
|
||||
|
||||
### Lever S1 (TOP) - bucketed/padded decode-step shape for graph reuse
|
||||
|
||||
**Value: high (targets the dominant -39% mechanism). Effort: medium-high. Risk:
|
||||
medium (correctness of padding inertness; seq-set churn is harder than n_tokens).**
|
||||
|
||||
Make the steady-state decode step present a **stable, bucketed shape** to both
|
||||
reuse layers, mirroring vLLM's padded decode batch + piecewise CUDA graphs:
|
||||
|
||||
- Pad the per-step decode `n_tokens` (and the stream/seq count the graph sees) up
|
||||
to the next bucket in a small fixed set (e.g. {power-of-two or fixed grid}), so
|
||||
`allow_reuse` (layer A) and `update_required` (layer B) hold across steps with
|
||||
the same bucket. Padding tokens are dummy, masked positions that contribute
|
||||
nothing to any real sequence's logits.
|
||||
- Bound the number of distinct live buckets so a handful of persistent CUDA
|
||||
graphs cover steady decode (vLLM captures ~tens).
|
||||
- Handle the seq-set component of `allow_reuse`: bucketing `n_tokens` alone is
|
||||
insufficient because the *participating sequence-id set* must also match. Either
|
||||
(a) pad to a fixed stream-slot layout so the seq-set is stable across arrivals
|
||||
/completions, or (b) relax/extend the reuse key so a pure-decode step keyed on
|
||||
bucket+slot-layout reuses regardless of which slots are occupied. (b) is the
|
||||
higher-leverage but more invasive option.
|
||||
|
||||
Bit-exact gate: greedy md5 per path with padding ON must equal the recorded
|
||||
references (`5951a5b4` dense, `8cb0ce23` paged-MoE); `test-backend-ops`
|
||||
unaffected (no op changes). The risk is that masked/padded positions leak into a
|
||||
real logit (off-by-one in the mask) - the md5 gate catches it.
|
||||
|
||||
### Lever S2 - overlap per-step host work with GPU decode (double-buffer inputs)
|
||||
|
||||
**Value: medium-high (recovers the `hostproc` window even when S1 partial).
|
||||
Effort: medium. Risk: low (host-side reordering only, bit-exact-safe).**
|
||||
|
||||
Even with graphs reused, `set_inputs` (+ the pre-`set_inputs` sync) runs
|
||||
un-graphed and serially *before* each launch (`hostproc` ~1.4 ms/step in prior
|
||||
profiles). Overlap the host scheduling + input build of step N+1 with the GPU
|
||||
decode of step N: double-buffer the input device tensors so the host can fill
|
||||
N+1's inputs while N's graph is in flight, and prepare the next ubatch / block
|
||||
table on the host concurrently. This is the llama.cpp analogue of vLLM keeping
|
||||
the GPU fed. Strictly host-side, no numeric change -> bit-exact. (0029 already
|
||||
banks part of this for the block table within a step; S2 extends it across
|
||||
steps.)
|
||||
|
||||
### Lever S3 - graph-shape-stable scheduling (bridge from 0016)
|
||||
|
||||
**Value: medium (multiplies S1; low marginal value without S1). Effort: low-medium
|
||||
(extends the existing 0016 policy). Risk: low (scheduler policy, bit-exact when
|
||||
the decode result is unchanged).**
|
||||
|
||||
Extend the existing decode-first budget (0016) so the scheduler actively *prefers
|
||||
graph-reusable steps*: keep prefill chunks out of the decode step (run prefill in
|
||||
its own steps, or at a fixed chunk size) so the decode batch shape stays on a
|
||||
bucket rather than being perturbed by interleaved prefill tokens every step. This
|
||||
is the policy half of S1 - S1 makes a bucketed step reusable; S3 makes the
|
||||
scheduler emit bucketed steps. Pair them.
|
||||
|
||||
**Rejected/deferred (record so they are not re-tried):**
|
||||
|
||||
- **More CUDA-graph *coverage* alone (the README lever-2 redo): still FLAT
|
||||
without S1.** Forcing more ops graph-eligible (beyond 0025) does nothing while
|
||||
layer A rebuilds the graph every step - the recapture dominates. Only valuable
|
||||
*after* S1 makes reuse hold.
|
||||
- **`GGML_CUDA_DISABLE_GRAPHS` / disabling graphs in serving: REJECTED a priori
|
||||
as a fix** (it is an A/B *probe* for Phase 0, not a lever) - it removes capture
|
||||
cost but also removes replay benefit; expected net-negative.
|
||||
- **Precision levers (W4A16, bf16-SSM): out of scope** - this gap is host-bound,
|
||||
not GEMM/BW-bound (see README s.5 rejections; do not reopen).
|
||||
|
||||
---
|
||||
|
||||
## 5. Phase 0 - confirm it is host-bound BEFORE building (run when the GPU frees)
|
||||
|
||||
Do NOT build any lever until this confirms host-bound. The dev tree already has
|
||||
all the instrumentation; this is a measurement, not a code change. **One GPU
|
||||
bencher at a time** (GPU-contention rule).
|
||||
|
||||
**Workload.** Real continuous serving, not batched-bench: run `llama-server`
|
||||
(paged build) with the paged config and drive it with a steady concurrent
|
||||
streaming load (e.g. a K-client async generator hitting `/completion` with
|
||||
staggered arrivals so requests start/finish asynchronously - the regime
|
||||
batched-bench cannot produce). Use the same models/flags as README s.4:
|
||||
`-fa on -ngl 99`, `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for MoE),
|
||||
dense Qwen3.6-27B-NVFP4 and MoE Qwen3.6-35B-A3B-NVFP4. Pick K so the *effective
|
||||
decode width* matches a static `npl` you have a kernel-regime number for (e.g.
|
||||
~128) - that gives the apples comparison: static 6.1 vs serving 3.7 tok/s/seq.
|
||||
|
||||
**Signals to capture (all already exist):**
|
||||
|
||||
1. **Graph reuse rate.** The `graphs reused = N` perf line (`llama-context.cpp`
|
||||
~L4146, from `data.n_reused`) over total decode steps. Hypothesis: ~100% in
|
||||
batched-bench, near 0% in serving. This is the single most decisive number.
|
||||
A/B with `LLAMA_GRAPH_REUSE_DISABLE=1` (forces the rebuild path) - if serving
|
||||
is already near that floor, layer-A reuse is the gap.
|
||||
2. **`[L5INSTR]` host buckets** (printed at exit): `hostproc`, `set_inputs`,
|
||||
`get_block_table` mean ms/step. Compare serving vs batched-bench. A/B the
|
||||
block-table cache with `LLAMA_PAGED_NO_BT_CACHE`.
|
||||
3. **GPU-busy %** in a steady-state serving window via nsys (sum of kernel
|
||||
durations / wall) and the **inter-launch host gap** (time between consecutive
|
||||
`cudaGraphLaunch`/kernel launches). Hypothesis: batched-bench ~96-99% busy
|
||||
(README/methodology note the early "low util" was a window artifact); serving
|
||||
materially lower, with the gap ~= `hostproc`/step. *Watch the same window
|
||||
artifact* the methodology warns about - measure a clean steady-state span.
|
||||
4. **CUDA-graph re-instantiation count** - confirm layer B is also re-capturing
|
||||
(nsys shows `cudaGraphInstantiate`/`cudaGraphExecUpdate` per step, or add a
|
||||
host-side counter print - host-side only, no kernel code).
|
||||
|
||||
**Decision rule.** Host-bound (proceed with S1/S2/S3) if: serving `graphs reused`
|
||||
is low AND `hostproc`/step is a large fraction of serving per-step wall AND
|
||||
GPU-busy% drops vs batched-bench by ~the observed throughput ratio (~3.7/6.1).
|
||||
If instead GPU-busy% stays high and per-kernel time grows, the cause is
|
||||
elsewhere (e.g. serving runs a worse effective batch shape into the kernels) -
|
||||
re-scope before building.
|
||||
|
||||
**Ground-truth vLLM (both-engine rule).** Capture vLLM at the same concurrency:
|
||||
GPU-busy% / step cadence (nsys) and its scheduler step time. Confirm vLLM stays
|
||||
GPU-bound (persistent graphs) where paged goes host-bound - that is the
|
||||
direct evidence the gap is the host loop, and it sizes the achievable win.
|
||||
|
||||
---
|
||||
|
||||
## 6. Summary
|
||||
|
||||
- The serving gap (paged 3.7 vs vLLM 5.9 tok/s/seq, -39%) is a **host/scheduler**
|
||||
problem, distinct from the decode **kernel** (at parity in batched-bench). The
|
||||
README's BW-floor/host-loop-residual findings are kernel-regime and do not
|
||||
bound the serving regime.
|
||||
- Leading mechanism: continuous batching's **batch-shape + seq-set churn breaks
|
||||
both graph-reuse layers** (llama-context `can_reuse`, CUDA `update_required`)
|
||||
every step, so the GPU idles while the host rebuilds + re-captures + runs
|
||||
un-graphed `set_inputs`. vLLM avoids this with padded/bucketed decode shapes +
|
||||
piecewise CUDA graphs.
|
||||
- The shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill
|
||||
freezing + burst collapse, **not** decode-step graph reuse - which is why the
|
||||
serving gap survives them.
|
||||
- Top levers (all host-side, bit-exact-safe): **S1** bucketed/padded decode-step
|
||||
shape for graph reuse, **S2** double-buffer/overlap per-step host work, **S3**
|
||||
graph-shape-stable scheduling (extend 0016). Gate everything on **Phase 0**:
|
||||
the `graphs reused` rate + `[L5INSTR]` host buckets + nsys GPU-busy% in real
|
||||
`llama-server` serving vs batched-bench, with vLLM ground-truthed at the same
|
||||
concurrency.
|
||||
</content>
|
||||
</invoke>
|
||||
Reference in New Issue
Block a user