docs(paged): reject capped TTFT defer sweep

Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 10:18:41 +00:00
parent 902bcc7717
commit 9be291e6b0
4 changed files with 195 additions and 2 deletions

View File

@@ -3199,3 +3199,51 @@ Decision:
The next scheduler work should either narrow the policy to dense/non-MoE
shapes or add a more selective condition that avoids the MoE mean-TTFT
regression.
## Phase 57 TTFT Prefill-First Cap Sweep
Phase 57 adds an optional per-step cap to the Phase55 opt-in policy:
`LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` preserves the Phase55
unlimited behavior. The goal was to keep some first-token relief while avoiding
the MoE `n=128` mean-TTFT regression from Phase56.
Fork commit:
- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral`
Artifact:
- `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`
Pre/post gates:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
MoE `n=128`, `ptok=128`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` |
| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` |
| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` |
| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` |
Dense `n=128`, `ptok=168`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` |
| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` |
| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` |
Decision:
- Reject capped TTFT defer as a parity lever. MoE cap32 improves mean TTFT
versus same-window default (`7425.5 -> 6994.0 ms`) but still loses aggregate
throughput and wall time. Dense caps improve or preserve TTFT only by losing
aggregate throughput and wall time.
- Keep the cap as an A/B knob only; do not promote it as a default or parity
path.

View File

@@ -32,8 +32,11 @@ Read order for a cold start:
> wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the
> policy helps dense `n=32` but regresses MoE `n=128` mean TTFT
> `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and
> do not default it broadly. The trace and scheduler commits are local and
> DGX-gated but not pushed, so the LocalAI patch series has not been regenerated.
> do not default it broadly. Phase57 tried a per-step defer cap; cap32 improved
> MoE mean TTFT in one same-window sweep but still lost aggregate and wall, and
> dense caps lost aggregate. Do not repeat capped-defer sweeps as the next parity
> path. The trace and scheduler commits are local and DGX-gated but not pushed,
> so the LocalAI patch series has not been regenerated.
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
unreachable. Treat that as superseded where Phase50-54 provide newer dense

View File

@@ -1396,6 +1396,39 @@ and aggregate throughput by `-0.4%`. Do not promote it as a broad default.
Future scheduler work should either narrow the policy to dense/non-MoE shapes or
make the defer condition more selective for MoE.
### Phase 57 capped TTFT defer sweep
Phase57 added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER` as an optional per-step cap
on the Phase55 policy. Unset or `0` keeps the Phase55 unlimited behavior.
Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`.
Pre/post md5 and op gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`.
MoE `n=128`, `ptok=128`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|---------|---------|-----------------|-------------|--------------|-------------|--------|
| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` |
| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` |
| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` |
| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` |
Dense `n=128`, `ptok=168`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|---------|---------|-----------------|-------------|--------------|-------------|--------|
| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` |
| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` |
| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` |
Decision: reject capped defer as a parity lever. cap32 is the only interesting
MoE point, but it trades lower mean TTFT for lower aggregate throughput and
higher wall time. Dense caps also lose aggregate. Keep the cap as an opt-in A/B
knob only.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,109 @@
# Phase57 TTFT Prefill-First Cap Sweep Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Test whether a per-step cap on `LLAMA_TTFT_PREFILL_FIRST=1` avoids the MoE mean-TTFT regression seen in Phase56 while preserving dense gains.
**Architecture:** Add a small optional cap to the existing default-off Phase55 policy. Unset or zero cap keeps Phase55 unlimited behavior. Gate with focused unit tests, then temporarily apply the stack to DGX for md5/op gates and an A/B cap sweep.
**Tech Stack:** llama.cpp fork, `tools/server/server-admission-policy.h`, `tools/server/server-context.cpp`, DGX GB10, `h2h_cli.py`, `paged-inference-gates.sh`.
---
### Task 1: Add capped helper
- [x] **Step 1: Write red test**
Added test cases for:
- zero cap means unlimited
- below cap defers
- at cap stops deferring
Observed red failure: the helper accepted only three arguments.
- [x] **Step 2: Implement cap helper and env**
Added overload:
```cpp
server_admission_should_defer_decode_for_ttft(enabled, prompt_waiting, n_decoded, deferred_so_far, max_deferred)
```
Added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` keeps unlimited
Phase55 behavior.
- [x] **Step 3: Verify local**
Commands passed:
```bash
cmake --build build --target test-server-admission-policy test-server-admission-trace llama-server -j2
./build/bin/test-server-admission-policy
./build/bin/test-server-admission-trace
ctest --test-dir build -R 'test-server-admission-(policy|trace)' --output-on-failure
```
- [x] **Step 4: Commit fork patch**
Local fork commit:
```text
3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral
```
### Task 2: DGX gate and cap sweep
- [x] **Step 1: Preflight and build**
Preflight: docker `0`, `local-ai-worker` `0`, compute `0`, lock
`FREE released-by-codex-phase56-validation 1782900217`, clean mirror at
`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`.
Applied `/tmp/phase57-ttft-cap-stack.patch`, built focused tests,
`llama-server`, `llama-cli`, and `test-backend-ops`. DGX focused CTests passed.
- [x] **Step 2: Run pre/post gates**
Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`.
Pre and post gates matched:
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT` `1146/1146`
- `MUL_MAT_ID` `806/806`
- [x] **Step 3: Run MoE cap sweep**
MoE `n=128`, `ptok=128`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` |
| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` |
| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` |
| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` |
- [x] **Step 4: Run dense cap sweep**
Dense `n=128`, `ptok=168`, `gen=64`:
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` |
| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` |
| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` |
- [x] **Step 5: Revert DGX stack**
Reverted the temporary patch stack, removed introduced files, and released the
lock as `FREE released-by-codex-phase57-cap 1782901003`.
### Task 3: Decision
- [x] **Step 1: Record outcome**
Decision: reject the cap as a parity lever. MoE cap32 improves mean TTFT versus
same-window default but still slightly loses aggregate and wall. Dense caps lose
aggregate versus the same-window default, and cap64 is broadly worse.