From 902bcc7717231a3fb2ad09a835da4762208ae370 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 10:05:23 +0000 Subject: [PATCH] docs(paged): validate TTFT prefill-first A/B Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 57 +++++++++ .../docs/PARITY_HANDOFF.md | 9 +- .../docs/VLLM_PARITY_LEVER_MAP.md | 31 +++++ ...1-ttft-prefill-first-validation-phase56.md | 119 ++++++++++++++++++ 4 files changed, 212 insertions(+), 4 deletions(-) create mode 100644 docs/superpowers/plans/2026-07-01-ttft-prefill-first-validation-phase56.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 8a7e5ec4b..f53702f20 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3142,3 +3142,60 @@ Mirror status: - The Phase55 fork commit is local and DGX-gated. - The LocalAI `patches/paged/` series is not regenerated yet because the fork branch still requires explicit push approval first. + +## Phase 56 TTFT Prefill-First Validation + +Phase 56 validates the Phase55 opt-in policy outside dense `n=128`. It makes no +code changes; the same Phase51+Phase54+Phase55 stack was applied temporarily to +the clean DGX mirror and reverted after the run. + +Artifact: + +- `/home/mudler/bench/phase56_ttft_prefill_first_validation/20260701_115852` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred decode slots | +|---------|---------|-----------------|-------------|--------------|-------------|--------|-----------------------| +| default | `341.1` | `651.2` | `1555.9` | `7168.1` | `11435.5` | `24.015` | `0` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `339.9` | `623.8` | `1622.7` | `7615.3` | `10964.4` | `24.098` | `441` | + +MoE deltas: + +- Aggregate throughput: `-0.4%` +- Prefill throughput: `+4.3%` +- Mean TTFT: `+6.2%` +- Max TTFT: `-4.1%` +- Wall time: `+0.3%` + +Dense `n=32`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred decode slots | +|---------|---------|-----------------|-------------|--------------|-------------|--------|-----------------------| +| default | `104.3` | `197.1` | `617.2` | `7687.7` | `9234.4` | `19.627` | `0` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `106.7` | `193.5` | `662.1` | `7284.3` | `8609.1` | `19.194` | `34` | + +Dense `n=32` deltas: + +- Aggregate throughput: `+2.3%` +- Prefill throughput: `+7.3%` +- Mean TTFT: `-5.2%` +- Max TTFT: `-6.8%` +- Wall time: `-2.2%` + +Decision: + +- Keep `LLAMA_TTFT_PREFILL_FIRST=1` as an opt-in A/B only. It helps dense + `n=128` and dense `n=32`, but MoE `n=128` regresses mean TTFT and slightly + regresses aggregate throughput. +- Do not make this policy default-on or promote it as a universal parity lever. + The next scheduler work should either narrow the policy to dense/non-MoE + shapes or add a more selective condition that avoids the MoE mean-TTFT + regression. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index e4d116d8c..f231cfcb3 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -29,10 +29,11 @@ Read order for a cold start: > (`decode_hist=128-255:53`). Phase55 implemented that targeted > first-token A/B as `LLAMA_TTFT_PREFILL_FIRST=1`: on dense `n=128` it improved > aggregate throughput `138.2 -> 142.9`, mean TTFT `23231.9 -> 21520.8 ms`, and -> wall `59.272 -> 57.323 s`, with md5/op gates green. Next scheduler work should -> test the same opt-in policy on MoE and another concurrency point. The trace and -> scheduler commits are local and DGX-gated but not pushed, so the LocalAI patch -> series has not been regenerated. +> wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the +> policy helps dense `n=32` but regresses MoE `n=128` mean TTFT +> `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and +> do not default it broadly. The trace and scheduler commits are local and +> DGX-gated but not pushed, so the LocalAI patch series has not been regenerated. - Historical verdict: the older investigation marked GB10 parity **CLOSED** and unreachable. Treat that as superseded where Phase50-54 provide newer dense diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index b02dfbf2d..620885924 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1365,6 +1365,37 @@ the policy shifts early compute from token 2+ decode to first-token prompt admission. Before any default-on discussion, test MoE serving and at least one additional concurrency point. +### Phase 56 TTFT prefill-first validation + +Phase56 made no code changes. It reapplied the Phase55 stack temporarily on DGX +and tested the opt-in policy on MoE `n=128` and dense `n=32`. Artifact: +`/home/mudler/bench/phase56_ttft_prefill_first_validation/20260701_115852`. + +Pre/post md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `341.1` | `651.2` | `1555.9` | `7168.1` | `11435.5` | `24.015` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `339.9` | `623.8` | `1622.7` | `7615.3` | `10964.4` | `24.098` | + +Dense `n=32`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `104.3` | `197.1` | `617.2` | `7687.7` | `9234.4` | `19.627` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `106.7` | `193.5` | `662.1` | `7284.3` | `8609.1` | `19.194` | + +Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` opt-in only. It helps dense +serving at `n=128` and `n=32`, but MoE `n=128` regresses mean TTFT by `+6.2%` +and aggregate throughput by `-0.4%`. Do not promote it as a broad default. +Future scheduler work should either narrow the policy to dense/non-MoE shapes or +make the defer condition more selective for MoE. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-ttft-prefill-first-validation-phase56.md b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-validation-phase56.md new file mode 100644 index 000000000..60c3c41ea --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-validation-phase56.md @@ -0,0 +1,119 @@ +# Phase56 TTFT Prefill-First Validation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Validate the Phase55 default-off `LLAMA_TTFT_PREFILL_FIRST=1` scheduler A/B beyond dense `n=128` before any default-on discussion. + +**Architecture:** Do not change code. Temporarily apply the already-local Phase51+Phase54+Phase55 fork stack to the clean DGX mirror, reuse the gated `build-cuda` path, bracket runs with md5/op gates, then compare default vs opt-in on MoE `n=128` and dense lower-concurrency `n=32`. + +**Tech Stack:** DGX GB10, llama.cpp `build-cuda`, `LLAMA_SERVING_TRACE=1`, `LLAMA_TTFT_PREFILL_FIRST=1`, `h2h_cli.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Prepare DGX Stack + +- [x] **Step 1: Preflight** + +Require: Docker `0`, `local-ai-worker` `0`, GPU compute apps `0`, lock `FREE*`, +and clean `~/llama-phase6-source`. + +Observed: docker `0`, `local-ai-worker` `0`, compute `0`, lock +`FREE released-by-codex-phase55-ttft 1782899730`, DGX mirror clean at +`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`. + +- [x] **Step 2: Apply stack and build** + +Apply `/tmp/phase55-ttft-prefill-first-stack.patch` or regenerate the same stack +from `/home/mudler/_git/llama.cpp`. Reconfigure CMake if needed, then build +`llama-server`, `llama-cli`, and `test-backend-ops`. + +Observed: stack applied, CMake reconfigured, and requested targets built. + +### Task 2: Gate Before Validation + +- [x] **Step 1: Run canonical pre-validation gate** + +Expected: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +Observed: all expected pre-validation gates matched. + +### Task 3: Run A/B Matrix + +- [x] **Step 1: Run MoE `n=128` default and opt-in** + +Model: `~/bench/q36-35b-a3b-nvfp4.gguf`. +Shape: `--parallel 128`, `-c 131072`, `-b 2048`, `-ub 512`, `n=128`, +`ptok=128`, `gen=64`. + +Default: + +```json +{"n": 128, "reqs": 128, "gen_total": 8191, "prompt_tok_total": 17793, "gen_per_req": 64.0, "agg_tps": 341.1, "decode_agg_tps": 651.2, "decode_perseq_tps": 3.93, "prefill_tps": 1555.9, "ttft_mean_ms": 7168.1, "ttft_max_ms": 11435.5, "wall_s": 24.015} +``` + +`LLAMA_TTFT_PREFILL_FIRST=1`: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 17793, "gen_per_req": 64.0, "agg_tps": 339.9, "decode_agg_tps": 623.8, "decode_perseq_tps": 3.92, "prefill_tps": 1622.7, "ttft_mean_ms": 7615.3, "ttft_max_ms": 10964.4, "wall_s": 24.098} +``` + +- [x] **Step 2: Run dense `n=32` default and opt-in** + +Model: `~/bench/q36-27b-nvfp4.gguf`. +Shape: `--parallel 128`, `-c 131072`, `-b 2048`, `-ub 512`, `n=32`, +`ptok=168`, `gen=64`. + +Default: + +```json +{"n": 32, "reqs": 32, "gen_total": 2048, "prompt_tok_total": 5700, "gen_per_req": 64.0, "agg_tps": 104.3, "decode_agg_tps": 197.1, "decode_perseq_tps": 5.42, "prefill_tps": 617.2, "ttft_mean_ms": 7687.7, "ttft_max_ms": 9234.4, "wall_s": 19.627} +``` + +`LLAMA_TTFT_PREFILL_FIRST=1`: + +```json +{"n": 32, "reqs": 32, "gen_total": 2048, "prompt_tok_total": 5700, "gen_per_req": 64.0, "agg_tps": 106.7, "decode_agg_tps": 193.5, "decode_perseq_tps": 5.37, "prefill_tps": 662.1, "ttft_mean_ms": 7284.3, "ttft_max_ms": 8609.1, "wall_s": 19.194} +``` + +### Task 4: Gate After Validation and Clean DGX + +- [x] **Step 1: Run canonical post-validation gate** + +Expected md5/op values match Task 2. + +Observed: all expected post-validation gates matched. + +- [x] **Step 2: Revert temporary DGX stack** + +Reverse the patch, remove untracked files introduced by the stack, release the +lock, and verify no compute apps remain. + +Observed: stack reverted, introduced files removed, lock released as +`FREE released-by-codex-phase56-validation 1782900217`, and no compute apps +were reported. + +### Task 5: Record Decision + +- [x] **Step 1: Update parity docs** + +Record the artifact, all A/B rows, trace counters, gates, and whether the policy +remains promising, is rejected, or needs narrower gating. + +Decision: keep the policy opt-in only. Dense `n=32` improved aggregate and TTFT, +but MoE `n=128` slightly regressed aggregate and mean TTFT, so the policy is not +safe as a broad default. + +- [x] **Step 2: Commit LocalAI docs** + +Use: + +```text +docs(paged): validate TTFT prefill-first A/B + +Assisted-by: Codex:gpt-5 +```