mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2, default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset (per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only, zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was implemented and correctness-proven green (canonical md5 both models default-off AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from control no more than control diverges from itself, the paged concurrent-greedy path being inherently non-deterministic run-to-run in the baseline too). The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per the phased contract go=false was the kill-gate default: nothing built beyond P0 (no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is expected and any throughput payoff is non-GB10. Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection) and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores go=true and triggers the full P4 build-out. Fork localai-paged untouched at 653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on the DGX fork at ebb649335 (base 653bb2f3d, not pushed). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>