docs(paged): record TTFT prefill-first A/B

Record Phase55 default-off scheduler A/B, DGX md5/op gates, dense serving results, and the pending fork push/mirror status. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 09:57:55 +00:00
parent 3dbf34e739
commit 999cf09532
4 changed files with 463 additions and 5 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3055,3 +3055,90 @@ Mirror status:
 - The LocalAI `patches/paged/` series is not regenerated yet because the
  handoff requires pushing the fork branch first, and pushes require explicit
  approval.
+
+## Phase 55 TTFT Prefill-First Scheduler A/B
+
+Phase 55 tests the first targeted scheduler policy after Phase53 rejected static
+budget shrinkage. The policy is default-off behind
+`LLAMA_TTFT_PREFILL_FIRST=1`: while any prompt is still waiting for first-token
+admission, defer token 2+ decode rows from already-started streams. This shifts
+early compute toward prompt admission without lowering `LLAMA_MAX_BATCH_TOKENS`.
+
+Fork commits in the local stack:
+
+- `c6cb8460e feat(server): trace serving admission batches`
+- `bd7b2e952 feat(server): add admission trace histograms`
+- `8a97629a4 feat(server): add TTFT prefill-first scheduler mode`
+
+Artifact:
+
+- `/home/mudler/bench/phase55_ttft_prefill_first/20260701_114929`
+
+Pre/post/after-A-B gates:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| after A/B | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+Focused test/build:
+
+- Red test first: `test-server-admission-policy` failed because
+  `server-admission-policy.h` did not exist.
+- Local fork: `test-server-admission-policy` and
+  `test-server-admission-trace` passed, CTest passed, and `llama-server` built.
+- DGX `build-cuda`: policy and trace tests passed under CTest after the
+  temporary Phase51+Phase54+Phase55 patch stack was applied.
+
+Dense A/B shape:
+
+- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf`
+- `LLAMA_SERVING_TRACE=1`
+- `N=128`, `PTOK=168`, `GEN=64`
+- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512`
+
+H2H result:
+
+| variant | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
+|---------|---------|-----------------|---------------------|-------------|--------------|-------------|--------|
+| default | `138.2` | `361.3` | `1.91` | `626.0` | `23231.9` | `36599.5` | `59.272` |
+| `LLAMA_TTFT_PREFILL_FIRST=1` | `142.9` | `336.9` | `1.86` | `694.2` | `21520.8` | `33008.2` | `57.323` |
+
+Delta:
+
+- Aggregate throughput: `+3.4%`
+- Prefill throughput: `+10.9%`
+- Mean TTFT: `-7.4%`
+- Max TTFT: `-9.8%`
+- Wall time: `-3.3%`
+- h2h decode-agg: `-6.8%`
+
+Default trace:
+
+```text
+serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 ttft_deferred_decode_slots=0 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1
+```
+
+Opt-in trace:
+
+```text
+serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=35 started_prompt_slots=128 continued_prompt_slots=139 ttft_deferred_decode_slots=660 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,257-512:1,513+:11 decode_hist=0:13,128-255:63 waiting_hist=0:63,1-7:1,8-15:3,16-31:8,32-63:1
+```
+
+Decision:
+
+- Keep Phase55 as a promising default-off scheduler A/B. It improves TTFT and
+  aggregate throughput on the dense `n=128` serving shape while all md5/op gates
+  remain green.
+- The drop in h2h `decode_agg_tps` is expected because the policy intentionally
+  defers token 2+ decode rows during prompt backlog. It should not be treated as
+  a correctness or kernel regression without a true decode profile.
+- Next phase should test the same opt-in policy on the MoE serving shape and at
+  another concurrency point before any default-on discussion.
+
+Mirror status:
+
+- The Phase55 fork commit is local and DGX-gated.
+- The LocalAI `patches/paged/` series is not regenerated yet because the fork
+  branch still requires explicit push approval first.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -20,16 +20,19 @@ Read order for a cold start:

 ## 1. TL;DR STATE

-> 2026-07-01 active update: Phase50-54 reopened the dense serving question.
+> 2026-07-01 active update: Phase50-55 reopened the dense serving question.
 > True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`)
 > than the Phase47 h2h aggregate suggested, while traced serving still shows
 > no pure decode-only steps and high TTFT. Phase53 rejected static lower
 > admission budgets; Phase54 histograms show prompt admission concentrated in a
 > few large chunks (`prompt_hist=513+:12`) with mostly full-width decode
-> (`decode_hist=128-255:53`). Next scheduler work should be a targeted
-> first-token admission or prompt-front-loading A/B, not another global
-> `LLAMA_MAX_BATCH_TOKENS` reduction. The trace commits are local and DGX-gated
-> but not pushed, so the LocalAI patch series has not been regenerated.
+> (`decode_hist=128-255:53`). Phase55 implemented that targeted
+> first-token A/B as `LLAMA_TTFT_PREFILL_FIRST=1`: on dense `n=128` it improved
+> aggregate throughput `138.2 -> 142.9`, mean TTFT `23231.9 -> 21520.8 ms`, and
+> wall `59.272 -> 57.323 s`, with md5/op gates green. Next scheduler work should
+> test the same opt-in policy on MoE and another concurrency point. The trace and
+> scheduler commits are local and DGX-gated but not pushed, so the LocalAI patch
+> series has not been regenerated.

 - Historical verdict: the older investigation marked GB10 parity **CLOSED** and
  unreachable. Treat that as superseded where Phase50-54 provide newer dense
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1323,6 +1323,48 @@ budget-shrinkage as the next path and points to a targeted first-token
 admission or prompt-front-loading A/B, gated by the same md5 and backend-op
 checks.

+### Phase 55 TTFT prefill-first scheduler A/B
+
+Phase55 implemented the targeted first-token admission A/B proposed by
+Phase54. It is default-off behind `LLAMA_TTFT_PREFILL_FIRST=1`; while any prompt
+is still waiting for first-token admission, it defers token 2+ decode rows from
+already-started streams. This does not lower `LLAMA_MAX_BATCH_TOKENS` and does
+not change default scheduling.
+
+Fork commit:
+`8a97629a4 feat(server): add TTFT prefill-first scheduler mode`.
+
+Artifact:
+`/home/mudler/bench/phase55_ttft_prefill_first/20260701_114929`.
+
+Pre/post/after-A-B md5 and op gates stayed green on the temporary DGX patch
+stack: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+Dense `n=128`, `ptok=168`, `gen=64` A/B:
+
+| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
+|---------|---------|-----------------|-------------|--------------|-------------|--------|
+| default | `138.2` | `361.3` | `626.0` | `23231.9` | `36599.5` | `59.272` |
+| `LLAMA_TTFT_PREFILL_FIRST=1` | `142.9` | `336.9` | `694.2` | `21520.8` | `33008.2` | `57.323` |
+
+Trace comparison:
+
+- Default: `ttft_deferred_decode_slots=0`,
+  `prompt_hist=0:63,1-64:1,513+:12`,
+  `decode_hist=0:3,1-63:10,64-127:10,128-255:53`.
+- Opt-in: `ttft_deferred_decode_slots=660`,
+  `prompt_hist=0:63,1-64:1,257-512:1,513+:11`,
+  `decode_hist=0:13,128-255:63`.
+
+Decision: keep the policy as a promising default-off scheduler A/B. It improved
+dense aggregate throughput by `+3.4%`, mean TTFT by `-7.4%`, max TTFT by
+`-9.8%`, and wall time by `-3.3%`. The h2h decode-agg drop is expected because
+the policy shifts early compute from token 2+ decode to first-token prompt
+admission. Before any default-on discussion, test MoE serving and at least one
+additional concurrency point.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update