docs(paged): validate TTFT prefill-first A/B

Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 10:05:23 +00:00
parent 999cf09532
commit 902bcc7717
4 changed files with 212 additions and 4 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3142,3 +3142,60 @@ Mirror status:
 - The Phase55 fork commit is local and DGX-gated.
 - The LocalAI `patches/paged/` series is not regenerated yet because the fork
  branch still requires explicit push approval first.
+
+## Phase 56 TTFT Prefill-First Validation
+
+Phase 56 validates the Phase55 opt-in policy outside dense `n=128`. It makes no
+code changes; the same Phase51+Phase54+Phase55 stack was applied temporarily to
+the clean DGX mirror and reverted after the run.
+
+Artifact:
+
+- `/home/mudler/bench/phase56_ttft_prefill_first_validation/20260701_115852`
+
+Pre/post gates:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+MoE `n=128`, `ptok=128`, `gen=64`:
+
+| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred decode slots |
+|---------|---------|-----------------|-------------|--------------|-------------|--------|-----------------------|
+| default | `341.1` | `651.2` | `1555.9` | `7168.1` | `11435.5` | `24.015` | `0` |
+| `LLAMA_TTFT_PREFILL_FIRST=1` | `339.9` | `623.8` | `1622.7` | `7615.3` | `10964.4` | `24.098` | `441` |
+
+MoE deltas:
+
+- Aggregate throughput: `-0.4%`
+- Prefill throughput: `+4.3%`
+- Mean TTFT: `+6.2%`
+- Max TTFT: `-4.1%`
+- Wall time: `+0.3%`
+
+Dense `n=32`, `ptok=168`, `gen=64`:
+
+| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred decode slots |
+|---------|---------|-----------------|-------------|--------------|-------------|--------|-----------------------|
+| default | `104.3` | `197.1` | `617.2` | `7687.7` | `9234.4` | `19.627` | `0` |
+| `LLAMA_TTFT_PREFILL_FIRST=1` | `106.7` | `193.5` | `662.1` | `7284.3` | `8609.1` | `19.194` | `34` |
+
+Dense `n=32` deltas:
+
+- Aggregate throughput: `+2.3%`
+- Prefill throughput: `+7.3%`
+- Mean TTFT: `-5.2%`
+- Max TTFT: `-6.8%`
+- Wall time: `-2.2%`
+
+Decision:
+
+- Keep `LLAMA_TTFT_PREFILL_FIRST=1` as an opt-in A/B only. It helps dense
+  `n=128` and dense `n=32`, but MoE `n=128` regresses mean TTFT and slightly
+  regresses aggregate throughput.
+- Do not make this policy default-on or promote it as a universal parity lever.
+  The next scheduler work should either narrow the policy to dense/non-MoE
+  shapes or add a more selective condition that avoids the MoE mean-TTFT
+  regression.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -29,10 +29,11 @@ Read order for a cold start:
 > (`decode_hist=128-255:53`). Phase55 implemented that targeted
 > first-token A/B as `LLAMA_TTFT_PREFILL_FIRST=1`: on dense `n=128` it improved
 > aggregate throughput `138.2 -> 142.9`, mean TTFT `23231.9 -> 21520.8 ms`, and
-> wall `59.272 -> 57.323 s`, with md5/op gates green. Next scheduler work should
-> test the same opt-in policy on MoE and another concurrency point. The trace and
-> scheduler commits are local and DGX-gated but not pushed, so the LocalAI patch
-> series has not been regenerated.
+> wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the
+> policy helps dense `n=32` but regresses MoE `n=128` mean TTFT
+> `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and
+> do not default it broadly. The trace and scheduler commits are local and
+> DGX-gated but not pushed, so the LocalAI patch series has not been regenerated.

 - Historical verdict: the older investigation marked GB10 parity **CLOSED** and
  unreachable. Treat that as superseded where Phase50-54 provide newer dense
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1365,6 +1365,37 @@ the policy shifts early compute from token 2+ decode to first-token prompt
 admission. Before any default-on discussion, test MoE serving and at least one
 additional concurrency point.

+### Phase 56 TTFT prefill-first validation
+
+Phase56 made no code changes. It reapplied the Phase55 stack temporarily on DGX
+and tested the opt-in policy on MoE `n=128` and dense `n=32`. Artifact:
+`/home/mudler/bench/phase56_ttft_prefill_first_validation/20260701_115852`.
+
+Pre/post md5 and op gates stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+MoE `n=128`, `ptok=128`, `gen=64`:
+
+| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
+|---------|---------|-----------------|-------------|--------------|-------------|--------|
+| default | `341.1` | `651.2` | `1555.9` | `7168.1` | `11435.5` | `24.015` |
+| `LLAMA_TTFT_PREFILL_FIRST=1` | `339.9` | `623.8` | `1622.7` | `7615.3` | `10964.4` | `24.098` |
+
+Dense `n=32`, `ptok=168`, `gen=64`:
+
+| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
+|---------|---------|-----------------|-------------|--------------|-------------|--------|
+| default | `104.3` | `197.1` | `617.2` | `7687.7` | `9234.4` | `19.627` |
+| `LLAMA_TTFT_PREFILL_FIRST=1` | `106.7` | `193.5` | `662.1` | `7284.3` | `8609.1` | `19.194` |
+
+Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` opt-in only. It helps dense
+serving at `n=128` and `n=32`, but MoE `n=128` regresses mean TTFT by `+6.2%`
+and aggregate throughput by `-0.4%`. Do not promote it as a broad default.
+Future scheduler work should either narrow the policy to dense/non-MoE shapes or
+make the defer condition more selective for MoE.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update