diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index f53702f20..07f2187b9 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3199,3 +3199,51 @@ Decision: The next scheduler work should either narrow the policy to dense/non-MoE shapes or add a more selective condition that avoids the MoE mean-TTFT regression. + +## Phase 57 TTFT Prefill-First Cap Sweep + +Phase 57 adds an optional per-step cap to the Phase55 opt-in policy: +`LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` preserves the Phase55 +unlimited behavior. The goal was to keep some first-token relief while avoiding +the MoE `n=128` mean-TTFT regression from Phase56. + +Fork commit: + +- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral` + +Artifact: + +- `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` | +| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` | +| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` | +| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` | + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` | +| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` | +| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` | + +Decision: + +- Reject capped TTFT defer as a parity lever. MoE cap32 improves mean TTFT + versus same-window default (`7425.5 -> 6994.0 ms`) but still loses aggregate + throughput and wall time. Dense caps improve or preserve TTFT only by losing + aggregate throughput and wall time. +- Keep the cap as an A/B knob only; do not promote it as a default or parity + path. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index f231cfcb3..1340f39c2c 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -32,8 +32,11 @@ Read order for a cold start: > wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the > policy helps dense `n=32` but regresses MoE `n=128` mean TTFT > `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and -> do not default it broadly. The trace and scheduler commits are local and -> DGX-gated but not pushed, so the LocalAI patch series has not been regenerated. +> do not default it broadly. Phase57 tried a per-step defer cap; cap32 improved +> MoE mean TTFT in one same-window sweep but still lost aggregate and wall, and +> dense caps lost aggregate. Do not repeat capped-defer sweeps as the next parity +> path. The trace and scheduler commits are local and DGX-gated but not pushed, +> so the LocalAI patch series has not been regenerated. - Historical verdict: the older investigation marked GB10 parity **CLOSED** and unreachable. Treat that as superseded where Phase50-54 provide newer dense diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 620885924..fcb81cff2 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1396,6 +1396,39 @@ and aggregate throughput by `-0.4%`. Do not promote it as a broad default. Future scheduler work should either narrow the policy to dense/non-MoE shapes or make the defer condition more selective for MoE. +### Phase 57 capped TTFT defer sweep + +Phase57 added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER` as an optional per-step cap +on the Phase55 policy. Unset or `0` keeps the Phase55 unlimited behavior. +Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`. + +Pre/post md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | +| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | +| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | +| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | +| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | +| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | + +Decision: reject capped defer as a parity lever. cap32 is the only interesting +MoE point, but it trades lower mean TTFT for lower aggregate throughput and +higher wall time. Dense caps also lose aggregate. Keep the cap as an opt-in A/B +knob only. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-ttft-prefill-first-cap-phase57.md b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-cap-phase57.md new file mode 100644 index 000000000..5d8b2c168 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-cap-phase57.md @@ -0,0 +1,109 @@ +# Phase57 TTFT Prefill-First Cap Sweep Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether a per-step cap on `LLAMA_TTFT_PREFILL_FIRST=1` avoids the MoE mean-TTFT regression seen in Phase56 while preserving dense gains. + +**Architecture:** Add a small optional cap to the existing default-off Phase55 policy. Unset or zero cap keeps Phase55 unlimited behavior. Gate with focused unit tests, then temporarily apply the stack to DGX for md5/op gates and an A/B cap sweep. + +**Tech Stack:** llama.cpp fork, `tools/server/server-admission-policy.h`, `tools/server/server-context.cpp`, DGX GB10, `h2h_cli.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Add capped helper + +- [x] **Step 1: Write red test** + +Added test cases for: + +- zero cap means unlimited +- below cap defers +- at cap stops deferring + +Observed red failure: the helper accepted only three arguments. + +- [x] **Step 2: Implement cap helper and env** + +Added overload: + +```cpp +server_admission_should_defer_decode_for_ttft(enabled, prompt_waiting, n_decoded, deferred_so_far, max_deferred) +``` + +Added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` keeps unlimited +Phase55 behavior. + +- [x] **Step 3: Verify local** + +Commands passed: + +```bash +cmake --build build --target test-server-admission-policy test-server-admission-trace llama-server -j2 +./build/bin/test-server-admission-policy +./build/bin/test-server-admission-trace +ctest --test-dir build -R 'test-server-admission-(policy|trace)' --output-on-failure +``` + +- [x] **Step 4: Commit fork patch** + +Local fork commit: + +```text +3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral +``` + +### Task 2: DGX gate and cap sweep + +- [x] **Step 1: Preflight and build** + +Preflight: docker `0`, `local-ai-worker` `0`, compute `0`, lock +`FREE released-by-codex-phase56-validation 1782900217`, clean mirror at +`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`. + +Applied `/tmp/phase57-ttft-cap-stack.patch`, built focused tests, +`llama-server`, `llama-cli`, and `test-backend-ops`. DGX focused CTests passed. + +- [x] **Step 2: Run pre/post gates** + +Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`. + +Pre and post gates matched: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 3: Run MoE cap sweep** + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` | +| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` | +| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` | +| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` | + +- [x] **Step 4: Run dense cap sweep** + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` | +| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` | +| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` | + +- [x] **Step 5: Revert DGX stack** + +Reverted the temporary patch stack, removed introduced files, and released the +lock as `FREE released-by-codex-phase57-cap 1782901003`. + +### Task 3: Decision + +- [x] **Step 1: Record outcome** + +Decision: reject the cap as a parity lever. MoE cap32 improves mean TTFT versus +same-window default but still slightly loses aggregate and wall. Dense caps lose +aggregate versus the same-window default, and cap64 is broadly worse.