docs(paged): reject capped TTFT defer sweep

Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 10:18:41 +00:00
parent 902bcc7717
commit 9be291e6b0
4 changed files with 195 additions and 2 deletions

View File

@@ -3199,3 +3199,51 @@ Decision:
The next scheduler work should either narrow the policy to dense/non-MoE
shapes or add a more selective condition that avoids the MoE mean-TTFT
regression.
## Phase 57 TTFT Prefill-First Cap Sweep
Phase 57 adds an optional per-step cap to the Phase55 opt-in policy:
`LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` preserves the Phase55
unlimited behavior. The goal was to keep some first-token relief while avoiding
the MoE `n=128` mean-TTFT regression from Phase56.
Fork commit:
- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral`
Artifact:
- `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`
Pre/post gates:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
MoE `n=128`, `ptok=128`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` |
| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` |
| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` |
| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` |
Dense `n=128`, `ptok=168`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` |
| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` |
| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` |
Decision:
- Reject capped TTFT defer as a parity lever. MoE cap32 improves mean TTFT
versus same-window default (`7425.5 -> 6994.0 ms`) but still loses aggregate
throughput and wall time. Dense caps improve or preserve TTFT only by losing
aggregate throughput and wall time.
- Keep the cap as an A/B knob only; do not promote it as a default or parity
path.

View File

@@ -32,8 +32,11 @@ Read order for a cold start:
> wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the
> policy helps dense `n=32` but regresses MoE `n=128` mean TTFT
> `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and
> do not default it broadly. The trace and scheduler commits are local and
> DGX-gated but not pushed, so the LocalAI patch series has not been regenerated.
> do not default it broadly. Phase57 tried a per-step defer cap; cap32 improved
> MoE mean TTFT in one same-window sweep but still lost aggregate and wall, and
> dense caps lost aggregate. Do not repeat capped-defer sweeps as the next parity
> path. The trace and scheduler commits are local and DGX-gated but not pushed,
> so the LocalAI patch series has not been regenerated.
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
unreachable. Treat that as superseded where Phase50-54 provide newer dense

View File

@@ -1396,6 +1396,39 @@ and aggregate throughput by `-0.4%`. Do not promote it as a broad default.
Future scheduler work should either narrow the policy to dense/non-MoE shapes or
make the defer condition more selective for MoE.
### Phase 57 capped TTFT defer sweep
Phase57 added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER` as an optional per-step cap
on the Phase55 policy. Unset or `0` keeps the Phase55 unlimited behavior.
Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`.
Pre/post md5 and op gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`.
MoE `n=128`, `ptok=128`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|---------|---------|-----------------|-------------|--------------|-------------|--------|
| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` |
| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` |
| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` |
| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` |
Dense `n=128`, `ptok=168`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|---------|---------|-----------------|-------------|--------------|-------------|--------|
| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` |
| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` |
| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` |
Decision: reject capped defer as a parity lever. cap32 is the only interesting
MoE point, but it trades lower mean TTFT for lower aggregate throughput and
higher wall time. Dense caps also lose aggregate. Keep the cap as an opt-in A/B
knob only.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update