mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record P4 CBv2 NO-GO at the perf kill-gate
P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2, default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset (per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only, zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was implemented and correctness-proven green (canonical md5 both models default-off AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from control no more than control diverges from itself, the paged concurrent-greedy path being inherently non-deterministic run-to-run in the baseline too). The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per the phased contract go=false was the kill-gate default: nothing built beyond P0 (no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is expected and any throughput payoff is non-GB10. Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection) and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores go=true and triggers the full P4 build-out. Fork localai-paged untouched at 653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on the DGX fork at ebb649335 (base 653bb2f3d, not pushed). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -555,6 +555,102 @@ records P2-at-this-granularity as a confirmed floor.
|
||||
- **Upstream-clash / rebase-safety:** safest area. `tools/server/server-context.cpp` is
|
||||
a fork-owned tool, not ggml core; upstream churns it less and conflicts are mechanical.
|
||||
|
||||
#### P4 RESULT (NO-GO at the P0 perf kill-gate, recorded 2026-07-02, `LLAMA_CONTINUOUS_BATCH_V2`, default-off)
|
||||
|
||||
The CBv2 P0 kill-gate subset (per-seq chunked-prefill cursors + adaptive decode
|
||||
bucketing) was **implemented and correctness-proven green**, but the P0 kill-gate's
|
||||
stated GO criterion - a **> 20% TTFT-under-load drop** with md5 green and
|
||||
serving-aggregate not regressed - was **NOT demonstrated**, so per the phased
|
||||
contract `go=false` was the kill-gate default, **nothing was built beyond P0**
|
||||
(no `SLOT_STATE_PREEMPTED`, no aging/starvation-freedom), and **nothing landed.**
|
||||
The topic branch `p4-cbv2` is retained on the DGX fork at
|
||||
`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `localai-paged` `653bb2f3d`, NOT
|
||||
pushed); the fork `localai-paged` HEAD is **untouched at `653bb2f3d`** and the
|
||||
LocalAI series stays at 46 patches (`0001-0055`). **This is the scope-anticipated
|
||||
outcome:** the P4 section frames CBv2 on GB10 as a TTFT + fairness + architecture-
|
||||
enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the
|
||||
host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the
|
||||
expected result and any throughput payoff lives on non-GB10 silicon (out of scope).
|
||||
|
||||
- **WHY THE PERF GATE DID NOT FIRE GO (honest caveat, not a measured neutrality).**
|
||||
The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
|
||||
only the CONTROL arm complete (30/60 raws; CANDIDATE arm never started).
|
||||
Consequently `ttft_n32_stag`, `ttft_n128_stag`, `ttft_n8`, `ttft_burst`, and
|
||||
`agg_delta_pct_worst` are **NOT-YET-MEASURED `0.0` placeholders**, not measured
|
||||
neutrality. No affirmative `> 20%` staggered-TTFT drop was demonstrated, so the
|
||||
kill-gate default (`go=false`) stands.
|
||||
- **CORRECTNESS GATES ALL GREEN (DGX GB10, arch sm_121a), the substantive P0
|
||||
result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
|
||||
- **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT
|
||||
1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46.
|
||||
- **(c) CURSOR INTERLEAVE PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps
|
||||
carry decode AND prefill tokens in the SAME batch with per-slot cursors
|
||||
advancing across steps, not slot-exclusive. Verbatim step=6: `n_decode_toks=5
|
||||
n_prefill_toks=1535 n_seqs=20` with 15 partial cursors; slot s112 advances
|
||||
144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs; adaptive
|
||||
fair-share cap tracks live load (410@5waiting, 171@12, 137@15, 291@7, 508@4);
|
||||
`dbucket==n_decode` confirms **no fixed pad-to-parallel** (per
|
||||
`DECODE_SERVING_SCOPE.md` net-negative-on-GB10).
|
||||
- **(b) SERVER DETERMINISM = CBv2 is NEUTRAL / correctness-preserving.** The
|
||||
literal exact-reproducibility gate is unsatisfiable by ANY scheduler here: the
|
||||
paged CONCURRENT greedy path is inherently non-deterministic run-to-run in the
|
||||
BASELINE too (the control default scheduler diverges from itself), a pre-existing
|
||||
benign near-tied-argmax / co-batch FP-reduction-order property
|
||||
(`PAGED_BITEXACT_NOTE`), on both dense and MoE. The discriminating test - does
|
||||
CBv2 diverge from control MORE than control diverges from itself - **PASSES**:
|
||||
across 8 configs {dense,moe} x {degenerate,natural} x {gen8,gen64}, per-request
|
||||
cross-arm divergence tracks the within-arm run-to-run baseline to +/-1-3 of 32
|
||||
(small-count noise; e.g. MoE-natural gen64 base 31/32 worst-cross 31/32;
|
||||
dense-degenerate base 14 cross 12-17). Single-sequence greedy is fully
|
||||
deterministic (the md5 gate above).
|
||||
- **Implementation (kill-gate subset only; correct, committed on `p4-cbv2`, NOT
|
||||
pushed; server-side only, ZERO `ggml/` files, ~68 LOC in `server-context.cpp` +
|
||||
a new unit-tested header).** (1) Per-seq chunked-prefill cursors with a
|
||||
**load-adaptive fair-share cap** = `ceil(prefill_leftover / n_waiting)` floored at
|
||||
`LLAMA_CBV2_CHUNK_MIN` (default 128, deliberately NOT `n_ubatch` so a 512-token
|
||||
prompt actually chunks under load); CBv2 activates the shipped 0016 decode-first
|
||||
budget by default (`T=n_batch`, no `LLAMA_MAX_BATCH_TOKENS` needed) and replaces
|
||||
0016's fixed cap with this fair-share cap; cursor = `slot.prompt.n_tokens()`
|
||||
advancing across steps. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD`
|
||||
default 0 => `bucket==n_decode`, no padding; policy computed+traced only, never
|
||||
fed to batch formation, so bit-exact-safe; row-emission for host-bound silicon is
|
||||
the deferred [Build phase]). Pure math lives in the NEW unit-tested header
|
||||
`tools/server/server-admission-policy.h` (namespace `cbv2`) +
|
||||
`server-admission-policy-test.cpp` (host-side unit tests ALL PASS local + DGX);
|
||||
`server-context.cpp` is the thin integration; step trace under `LLAMA_CBV2_TRACE=1`.
|
||||
- **Honest delta vs expectation.** Kill-gate GO required TTFT-under-load to drop
|
||||
`> 20%`; **delivered: not demonstrated** (perf A/B force-terminated control-only).
|
||||
The correctness substrate (bit-exact md5, proven decode+prefill co-batching with
|
||||
per-seq cursors, determinism-neutrality) is real and is the enabler the scope
|
||||
values, but the perf axis that gates the phase was never measured to GO.
|
||||
- **WHAT WOULD CHANGE THE VERDICT (re-score path).** Read the finalized DGX
|
||||
`~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes
|
||||
(the perf driver `p4_agg.py` auto-writes medians+stdev deltas with the
|
||||
`> 20%`-TTFT-drop GO logic baked in). **IF** it shows a genuine `> 20%`
|
||||
staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5 green and aggregate not
|
||||
regressed, re-score `go=true` and trigger the **full P4 build-out**:
|
||||
`SLOT_STATE_PREEMPTED` + release-KV-keep-prompt-tokens re-admit (reusing the paged
|
||||
burst-reclaim patch 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom with
|
||||
a constructed starvation test, preemption-transition + aging unit tests, and a
|
||||
forced-preemption byte-identical-resume determinism gate. **ELSE** (the
|
||||
scope-expected case) this NO-GO stands and P4 is deferred as a GB10 TTFT/fairness/
|
||||
enabler lever whose throughput payoff is non-GB10.
|
||||
- **Series-numbering flag (for whoever lands a future GO).** The P0 code comments
|
||||
label `[paged 0056]` per the pinned fork's next slot (46 patches), but the LocalAI
|
||||
worktree README is already ahead at `0056-0061` (the MoE MMQ trace series) -
|
||||
reconcile the actual series number on landing (likely `0062`).
|
||||
- **Artifacts (DGX `~/bench/p4_cbv2/`):** `build_20260702_192141/` (build.log);
|
||||
`gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt,
|
||||
determinism tsvs); `det2_20260702_193123/` + `det3_20260702_193649/` +
|
||||
`det4_20260702_194040/` (determinism diff-matrix: degenerate / natural / gen8);
|
||||
`perf_20260702_194359/` (raw_*.json + auto-written RESULTS.md). Environment:
|
||||
`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset,
|
||||
sm_121a, GPU lock held. Code on `p4-cbv2` `ebb649335`:
|
||||
`tools/server/server-admission-policy.h`, `server-admission-policy-test.cpp`,
|
||||
`server-context.cpp` (+68).
|
||||
|
||||
### P5: FLA-faithful GDN prefill scan (blocked solve_tril port; the algorithm never actually tested in-backend)
|
||||
|
||||
- **Goal:** replace the hand f32 chunked scan (`gdn_core`, 95.7 us/tok, 2.62x vLLM) with
|
||||
|
||||
@@ -2474,3 +2474,84 @@ pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/`
|
||||
(sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`,
|
||||
`.../killgate_20260702_171826/` (engagement proof, 0x on both models),
|
||||
`.../build_20260702_145928/` (build logs).
|
||||
|
||||
## P4 token-granular continuous-batching scheduler (CBv2) - NO-GO at the perf kill-gate (recorded 2026-07-02)
|
||||
|
||||
Fourth phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 kill-gate
|
||||
subset for `LLAMA_CONTINUOUS_BATCH_V2` (default-off) was **implemented and
|
||||
correctness-proven green**, but the kill-gate's stated GO criterion - a **> 20%
|
||||
TTFT-under-load drop** with md5 green and serving-aggregate not regressed - was
|
||||
**NOT demonstrated**, so per the phased contract `go=false` was the kill-gate
|
||||
default, nothing was built beyond P0, and nothing landed. See the "P4 RESULT"
|
||||
subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; summary and
|
||||
provenance:
|
||||
|
||||
- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (scope-anticipated).** The P4
|
||||
section frames CBv2 on GB10 as a TTFT + fairness + architecture-enabler lever,
|
||||
**not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead
|
||||
measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and
|
||||
any throughput payoff is non-GB10 (out of scope).
|
||||
- **Why the perf gate did not fire GO (honest caveat, not measured neutrality).**
|
||||
The staggered/burst TTFT A/B was **force-terminated by the harness mid-run** with
|
||||
only the CONTROL arm complete (30/60 raws; CANDIDATE never started), so the four
|
||||
`ttft_*_delta_pct` and `agg_delta_pct_worst` fields are **NOT-YET-MEASURED `0.0`
|
||||
placeholders**, not measured neutrality. No affirmative `> 20%` staggered-TTFT drop
|
||||
was demonstrated; the kill-gate default (`go=false`) stands.
|
||||
- **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.**
|
||||
Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off):
|
||||
- **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE
|
||||
`8cb0ce23`, dense `5951a5b4`.
|
||||
- **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT
|
||||
1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46.
|
||||
- **(c) cursor-interleave PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps
|
||||
co-batch decode AND prefill tokens with per-slot cursors advancing across steps
|
||||
(step=6: `n_decode_toks=5 n_prefill_toks=1535 n_seqs=20`, 15 partial cursors;
|
||||
slot s112 144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs); adaptive
|
||||
fair-share cap tracks live load (410@5w, 171@12, 137@15); `dbucket==n_decode` =>
|
||||
no fixed pad-to-parallel.
|
||||
- **(b) determinism = CBv2 NEUTRAL / correctness-preserving.** The paged concurrent
|
||||
greedy path is inherently non-deterministic run-to-run in the BASELINE too (a
|
||||
benign near-tied-argmax / co-batch FP-reduction-order property,
|
||||
`PAGED_BITEXACT_NOTE`), so the literal exact-match gate is unsatisfiable by any
|
||||
scheduler (control fails it too). The discriminating test - does CBv2 diverge
|
||||
from control more than control diverges from itself - PASSES across 8 configs
|
||||
{dense,moe} x {degenerate,natural} x {gen8,gen64}: cross-arm divergence tracks
|
||||
the within-arm baseline to +/-1-3 of 32. Single-sequence greedy is fully
|
||||
deterministic (the md5 gate).
|
||||
- **What would change the verdict (re-score path).** Read the finalized DGX
|
||||
`~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes
|
||||
(`p4_agg.py` auto-writes medians+stdev with the `> 20%`-drop GO logic baked in). If
|
||||
it shows a genuine `> 20%` staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5
|
||||
green and aggregate not regressed, re-score `go=true` and trigger the full P4
|
||||
build-out: `SLOT_STATE_PREEMPTED` + release-KV-keep-prompt re-admit (paged
|
||||
burst-reclaim 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom + a
|
||||
constructed starvation test, preemption/aging unit tests, and a forced-preemption
|
||||
byte-identical-resume determinism gate. Else this NO-GO stands.
|
||||
|
||||
Implementation (kill-gate subset only; correct, committed, NOT pushed; server-side
|
||||
only, ZERO `ggml/` files, ~68 LOC): `tools/server/server-context.cpp` thin
|
||||
integration + a NEW pure unit-tested header `tools/server/server-admission-policy.h`
|
||||
(namespace `cbv2`) + `server-admission-policy-test.cpp`. (1) Per-seq chunked-prefill
|
||||
cursors with a load-adaptive fair-share cap `ceil(prefill_leftover/n_waiting)`
|
||||
floored at `LLAMA_CBV2_CHUNK_MIN` (default 128, NOT `n_ubatch`, so a 512 prompt
|
||||
actually chunks under load); CBv2 activates the shipped 0016 decode-first budget by
|
||||
default (`T=n_batch`) and replaces 0016's fixed cap; cursor =
|
||||
`slot.prompt.n_tokens()`. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD`
|
||||
default 0 => `bucket==n_decode`, no padding per `DECODE_SERVING_SCOPE.md`
|
||||
net-negative; policy computed+traced only, never fed to batch formation =>
|
||||
bit-exact-safe; row-emission is the deferred [Build phase]). Trace under
|
||||
`LLAMA_CBV2_TRACE=1`.
|
||||
|
||||
Series-numbering flag: P0 comments label `[paged 0056]` per the fork's next slot,
|
||||
but the LocalAI worktree README is already ahead at `0056-0061` (MoE MMQ trace
|
||||
series) - reconcile on landing (likely `0062`).
|
||||
|
||||
Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46
|
||||
patches (`0001-0055`). Topic branch `mudler/llama.cpp:p4-cbv2` retained at
|
||||
`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `653bb2f3d`, NOT pushed). Artifacts
|
||||
on the DGX `~/bench/p4_cbv2/`: `build_20260702_192141/` (build.log),
|
||||
`gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt,
|
||||
determinism tsvs), `det2_20260702_193123/` + `det3_20260702_193649/` +
|
||||
`det4_20260702_194040/` (determinism diff-matrix), `perf_20260702_194359/`
|
||||
(raw_*.json + auto-written RESULTS.md). Environment: `LLAMA_KV_PAGED=1
|
||||
LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, GPU lock held.
|
||||
|
||||
Reference in New Issue
Block a user