mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): reject capped TTFT defer sweep
Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3199,3 +3199,51 @@ Decision:
|
||||
The next scheduler work should either narrow the policy to dense/non-MoE
|
||||
shapes or add a more selective condition that avoids the MoE mean-TTFT
|
||||
regression.
|
||||
|
||||
## Phase 57 TTFT Prefill-First Cap Sweep
|
||||
|
||||
Phase 57 adds an optional per-step cap to the Phase55 opt-in policy:
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` preserves the Phase55
|
||||
unlimited behavior. The goal was to keep some first-token relief while avoiding
|
||||
the MoE `n=128` mean-TTFT regression from Phase56.
|
||||
|
||||
Fork commit:
|
||||
|
||||
- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral`
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`
|
||||
|
||||
Pre/post gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` |
|
||||
| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` |
|
||||
| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` |
|
||||
| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` |
|
||||
|
||||
Dense `n=128`, `ptok=168`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` |
|
||||
| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` |
|
||||
| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject capped TTFT defer as a parity lever. MoE cap32 improves mean TTFT
|
||||
versus same-window default (`7425.5 -> 6994.0 ms`) but still loses aggregate
|
||||
throughput and wall time. Dense caps improve or preserve TTFT only by losing
|
||||
aggregate throughput and wall time.
|
||||
- Keep the cap as an A/B knob only; do not promote it as a default or parity
|
||||
path.
|
||||
|
||||
@@ -32,8 +32,11 @@ Read order for a cold start:
|
||||
> wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the
|
||||
> policy helps dense `n=32` but regresses MoE `n=128` mean TTFT
|
||||
> `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and
|
||||
> do not default it broadly. The trace and scheduler commits are local and
|
||||
> DGX-gated but not pushed, so the LocalAI patch series has not been regenerated.
|
||||
> do not default it broadly. Phase57 tried a per-step defer cap; cap32 improved
|
||||
> MoE mean TTFT in one same-window sweep but still lost aggregate and wall, and
|
||||
> dense caps lost aggregate. Do not repeat capped-defer sweeps as the next parity
|
||||
> path. The trace and scheduler commits are local and DGX-gated but not pushed,
|
||||
> so the LocalAI patch series has not been regenerated.
|
||||
|
||||
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
|
||||
unreachable. Treat that as superseded where Phase50-54 provide newer dense
|
||||
|
||||
@@ -1396,6 +1396,39 @@ and aggregate throughput by `-0.4%`. Do not promote it as a broad default.
|
||||
Future scheduler work should either narrow the policy to dense/non-MoE shapes or
|
||||
make the defer condition more selective for MoE.
|
||||
|
||||
### Phase 57 capped TTFT defer sweep
|
||||
|
||||
Phase57 added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER` as an optional per-step cap
|
||||
on the Phase55 policy. Unset or `0` keeps the Phase55 unlimited behavior.
|
||||
Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`.
|
||||
|
||||
Pre/post md5 and op gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`.
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|
|
||||
| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` |
|
||||
| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` |
|
||||
| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` |
|
||||
| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` |
|
||||
|
||||
Dense `n=128`, `ptok=168`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|
|
||||
| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` |
|
||||
| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` |
|
||||
| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` |
|
||||
|
||||
Decision: reject capped defer as a parity lever. cap32 is the only interesting
|
||||
MoE point, but it trades lower mean TTFT for lower aggregate throughput and
|
||||
higher wall time. Dense caps also lose aggregate. Keep the cap as an opt-in A/B
|
||||
knob only.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user