docs(paged): record P4 CBv2 NO-GO at the perf kill-gate

P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2,
default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset
(per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only,
zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was
implemented and correctness-proven green (canonical md5 both models default-off
AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146,
MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via
LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing
across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from
control no more than control diverges from itself, the paged concurrent-greedy
path being inherently non-deterministic run-to-run in the baseline too).

The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and
serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT
A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so
the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per
the phased contract go=false was the kill-gate default: nothing built beyond P0
(no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is
the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a
throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is
expected and any throughput payoff is non-GB10.

Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection)
and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized
DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm
completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores
go=true and triggers the full P4 build-out. Fork localai-paged untouched at
653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on
the DGX fork at ebb649335 (base 653bb2f3d, not pushed).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-07-02 18:03:34 +00:00
parent 586639d016
commit 865e77c4ec
2 changed files with 177 additions and 0 deletions

View File

@@ -555,6 +555,102 @@ records P2-at-this-granularity as a confirmed floor.
- **Upstream-clash / rebase-safety:** safest area. `tools/server/server-context.cpp` is
a fork-owned tool, not ggml core; upstream churns it less and conflicts are mechanical.
#### P4 RESULT (NO-GO at the P0 perf kill-gate, recorded 2026-07-02, `LLAMA_CONTINUOUS_BATCH_V2`, default-off)
The CBv2 P0 kill-gate subset (per-seq chunked-prefill cursors + adaptive decode
bucketing) was **implemented and correctness-proven green**, but the P0 kill-gate's
stated GO criterion - a **> 20% TTFT-under-load drop** with md5 green and
serving-aggregate not regressed - was **NOT demonstrated**, so per the phased
contract `go=false` was the kill-gate default, **nothing was built beyond P0**
(no `SLOT_STATE_PREEMPTED`, no aging/starvation-freedom), and **nothing landed.**
The topic branch `p4-cbv2` is retained on the DGX fork at
`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `localai-paged` `653bb2f3d`, NOT
pushed); the fork `localai-paged` HEAD is **untouched at `653bb2f3d`** and the
LocalAI series stays at 46 patches (`0001-0055`). **This is the scope-anticipated
outcome:** the P4 section frames CBv2 on GB10 as a TTFT + fairness + architecture-
enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the
host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the
expected result and any throughput payoff lives on non-GB10 silicon (out of scope).
- **WHY THE PERF GATE DID NOT FIRE GO (honest caveat, not a measured neutrality).**
The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
only the CONTROL arm complete (30/60 raws; CANDIDATE arm never started).
Consequently `ttft_n32_stag`, `ttft_n128_stag`, `ttft_n8`, `ttft_burst`, and
`agg_delta_pct_worst` are **NOT-YET-MEASURED `0.0` placeholders**, not measured
neutrality. No affirmative `> 20%` staggered-TTFT drop was demonstrated, so the
kill-gate default (`go=false`) stands.
- **CORRECTNESS GATES ALL GREEN (DGX GB10, arch sm_121a), the substantive P0
result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
- **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`.
- **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT
1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46.
- **(c) CURSOR INTERLEAVE PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps
carry decode AND prefill tokens in the SAME batch with per-slot cursors
advancing across steps, not slot-exclusive. Verbatim step=6: `n_decode_toks=5
n_prefill_toks=1535 n_seqs=20` with 15 partial cursors; slot s112 advances
144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs; adaptive
fair-share cap tracks live load (410@5waiting, 171@12, 137@15, 291@7, 508@4);
`dbucket==n_decode` confirms **no fixed pad-to-parallel** (per
`DECODE_SERVING_SCOPE.md` net-negative-on-GB10).
- **(b) SERVER DETERMINISM = CBv2 is NEUTRAL / correctness-preserving.** The
literal exact-reproducibility gate is unsatisfiable by ANY scheduler here: the
paged CONCURRENT greedy path is inherently non-deterministic run-to-run in the
BASELINE too (the control default scheduler diverges from itself), a pre-existing
benign near-tied-argmax / co-batch FP-reduction-order property
(`PAGED_BITEXACT_NOTE`), on both dense and MoE. The discriminating test - does
CBv2 diverge from control MORE than control diverges from itself - **PASSES**:
across 8 configs {dense,moe} x {degenerate,natural} x {gen8,gen64}, per-request
cross-arm divergence tracks the within-arm run-to-run baseline to +/-1-3 of 32
(small-count noise; e.g. MoE-natural gen64 base 31/32 worst-cross 31/32;
dense-degenerate base 14 cross 12-17). Single-sequence greedy is fully
deterministic (the md5 gate above).
- **Implementation (kill-gate subset only; correct, committed on `p4-cbv2`, NOT
pushed; server-side only, ZERO `ggml/` files, ~68 LOC in `server-context.cpp` +
a new unit-tested header).** (1) Per-seq chunked-prefill cursors with a
**load-adaptive fair-share cap** = `ceil(prefill_leftover / n_waiting)` floored at
`LLAMA_CBV2_CHUNK_MIN` (default 128, deliberately NOT `n_ubatch` so a 512-token
prompt actually chunks under load); CBv2 activates the shipped 0016 decode-first
budget by default (`T=n_batch`, no `LLAMA_MAX_BATCH_TOKENS` needed) and replaces
0016's fixed cap with this fair-share cap; cursor = `slot.prompt.n_tokens()`
advancing across steps. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD`
default 0 => `bucket==n_decode`, no padding; policy computed+traced only, never
fed to batch formation, so bit-exact-safe; row-emission for host-bound silicon is
the deferred [Build phase]). Pure math lives in the NEW unit-tested header
`tools/server/server-admission-policy.h` (namespace `cbv2`) +
`server-admission-policy-test.cpp` (host-side unit tests ALL PASS local + DGX);
`server-context.cpp` is the thin integration; step trace under `LLAMA_CBV2_TRACE=1`.
- **Honest delta vs expectation.** Kill-gate GO required TTFT-under-load to drop
`> 20%`; **delivered: not demonstrated** (perf A/B force-terminated control-only).
The correctness substrate (bit-exact md5, proven decode+prefill co-batching with
per-seq cursors, determinism-neutrality) is real and is the enabler the scope
values, but the perf axis that gates the phase was never measured to GO.
- **WHAT WOULD CHANGE THE VERDICT (re-score path).** Read the finalized DGX
`~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes
(the perf driver `p4_agg.py` auto-writes medians+stdev deltas with the
`> 20%`-TTFT-drop GO logic baked in). **IF** it shows a genuine `> 20%`
staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5 green and aggregate not
regressed, re-score `go=true` and trigger the **full P4 build-out**:
`SLOT_STATE_PREEMPTED` + release-KV-keep-prompt-tokens re-admit (reusing the paged
burst-reclaim patch 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom with
a constructed starvation test, preemption-transition + aging unit tests, and a
forced-preemption byte-identical-resume determinism gate. **ELSE** (the
scope-expected case) this NO-GO stands and P4 is deferred as a GB10 TTFT/fairness/
enabler lever whose throughput payoff is non-GB10.
- **Series-numbering flag (for whoever lands a future GO).** The P0 code comments
label `[paged 0056]` per the pinned fork's next slot (46 patches), but the LocalAI
worktree README is already ahead at `0056-0061` (the MoE MMQ trace series) -
reconcile the actual series number on landing (likely `0062`).
- **Artifacts (DGX `~/bench/p4_cbv2/`):** `build_20260702_192141/` (build.log);
`gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt,
determinism tsvs); `det2_20260702_193123/` + `det3_20260702_193649/` +
`det4_20260702_194040/` (determinism diff-matrix: degenerate / natural / gen8);
`perf_20260702_194359/` (raw_*.json + auto-written RESULTS.md). Environment:
`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset,
sm_121a, GPU lock held. Code on `p4-cbv2` `ebb649335`:
`tools/server/server-admission-policy.h`, `server-admission-policy-test.cpp`,
`server-context.cpp` (+68).
### P5: FLA-faithful GDN prefill scan (blocked solve_tril port; the algorithm never actually tested in-backend)
- **Goal:** replace the hand f32 chunked scan (`gdn_core`, 95.7 us/tok, 2.62x vLLM) with

View File

@@ -2474,3 +2474,84 @@ pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/`
(sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`,
`.../killgate_20260702_171826/` (engagement proof, 0x on both models),
`.../build_20260702_145928/` (build logs).
## P4 token-granular continuous-batching scheduler (CBv2) - NO-GO at the perf kill-gate (recorded 2026-07-02)
Fourth phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 kill-gate
subset for `LLAMA_CONTINUOUS_BATCH_V2` (default-off) was **implemented and
correctness-proven green**, but the kill-gate's stated GO criterion - a **> 20%
TTFT-under-load drop** with md5 green and serving-aggregate not regressed - was
**NOT demonstrated**, so per the phased contract `go=false` was the kill-gate
default, nothing was built beyond P0, and nothing landed. See the "P4 RESULT"
subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; summary and
provenance:
- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (scope-anticipated).** The P4
section frames CBv2 on GB10 as a TTFT + fairness + architecture-enabler lever,
**not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead
measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and
any throughput payoff is non-GB10 (out of scope).
- **Why the perf gate did not fire GO (honest caveat, not measured neutrality).**
The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
only the CONTROL arm complete (30/60 raws; CANDIDATE never started), so the four
`ttft_*_delta_pct` and `agg_delta_pct_worst` fields are **NOT-YET-MEASURED `0.0`
placeholders**, not measured neutrality. No affirmative `> 20%` staggered-TTFT drop
was demonstrated; the kill-gate default (`go=false`) stands.
- **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.**
Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
- **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE
`8cb0ce23`, dense `5951a5b4`.
- **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT
1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46.
- **(c) cursor-interleave PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps
co-batch decode AND prefill tokens with per-slot cursors advancing across steps
(step=6: `n_decode_toks=5 n_prefill_toks=1535 n_seqs=20`, 15 partial cursors;
slot s112 144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs); adaptive
fair-share cap tracks live load (410@5w, 171@12, 137@15); `dbucket==n_decode` =>
no fixed pad-to-parallel.
- **(b) determinism = CBv2 NEUTRAL / correctness-preserving.** The paged concurrent
greedy path is inherently non-deterministic run-to-run in the BASELINE too (a
benign near-tied-argmax / co-batch FP-reduction-order property,
`PAGED_BITEXACT_NOTE`), so the literal exact-match gate is unsatisfiable by any
scheduler (control fails it too). The discriminating test - does CBv2 diverge
from control more than control diverges from itself - PASSES across 8 configs
{dense,moe} x {degenerate,natural} x {gen8,gen64}: cross-arm divergence tracks
the within-arm baseline to +/-1-3 of 32. Single-sequence greedy is fully
deterministic (the md5 gate).
- **What would change the verdict (re-score path).** Read the finalized DGX
`~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes
(`p4_agg.py` auto-writes medians+stdev with the `> 20%`-drop GO logic baked in). If
it shows a genuine `> 20%` staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5
green and aggregate not regressed, re-score `go=true` and trigger the full P4
build-out: `SLOT_STATE_PREEMPTED` + release-KV-keep-prompt re-admit (paged
burst-reclaim 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom + a
constructed starvation test, preemption/aging unit tests, and a forced-preemption
byte-identical-resume determinism gate. Else this NO-GO stands.
Implementation (kill-gate subset only; correct, committed, NOT pushed; server-side
only, ZERO `ggml/` files, ~68 LOC): `tools/server/server-context.cpp` thin
integration + a NEW pure unit-tested header `tools/server/server-admission-policy.h`
(namespace `cbv2`) + `server-admission-policy-test.cpp`. (1) Per-seq chunked-prefill
cursors with a load-adaptive fair-share cap `ceil(prefill_leftover/n_waiting)`
floored at `LLAMA_CBV2_CHUNK_MIN` (default 128, NOT `n_ubatch`, so a 512 prompt
actually chunks under load); CBv2 activates the shipped 0016 decode-first budget by
default (`T=n_batch`) and replaces 0016's fixed cap; cursor =
`slot.prompt.n_tokens()`. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD`
default 0 => `bucket==n_decode`, no padding per `DECODE_SERVING_SCOPE.md`
net-negative; policy computed+traced only, never fed to batch formation =>
bit-exact-safe; row-emission is the deferred [Build phase]). Trace under
`LLAMA_CBV2_TRACE=1`.
Series-numbering flag: P0 comments label `[paged 0056]` per the fork's next slot,
but the LocalAI worktree README is already ahead at `0056-0061` (MoE MMQ trace
series) - reconcile on landing (likely `0062`).
Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46
patches (`0001-0055`). Topic branch `mudler/llama.cpp:p4-cbv2` retained at
`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `653bb2f3d`, NOT pushed). Artifacts
on the DGX `~/bench/p4_cbv2/`: `build_20260702_192141/` (build.log),
`gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt,
determinism tsvs), `det2_20260702_193123/` + `det3_20260702_193649/` +
`det4_20260702_194040/` (determinism diff-matrix), `perf_20260702_194359/`
(raw_*.json + auto-written RESULTS.md). Environment: `LLAMA_KV_PAGED=1
LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, GPU lock held.