mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record waiting-threshold TTFT defer
Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3247,3 +3247,53 @@ Decision:
|
||||
aggregate throughput and wall time.
|
||||
- Keep the cap as an A/B knob only; do not promote it as a default or parity
|
||||
path.
|
||||
|
||||
## Phase 58 TTFT Prefill-First Waiting Threshold
|
||||
|
||||
Phase 58 adds `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING`, a prompt-backlog threshold
|
||||
for the Phase55 opt-in policy. Unset or `0` preserves prior behavior. The goal
|
||||
was to activate decode deferral only during high prompt-backlog windows instead
|
||||
of for the entire prompt backlog lifetime.
|
||||
|
||||
Fork commit:
|
||||
|
||||
- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase58_ttft_waiting_sweep/20260701_122052`
|
||||
|
||||
Pre/post gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| default | `339.0` | `648.4` | `1542.9` | `7743.1` | `11532.5` | `24.167` | `0` |
|
||||
| min24 | `339.9` | `619.3` | `1637.0` | `7326.6` | `10868.8` | `24.095` | `323` |
|
||||
| min32 | `341.9` | `635.0` | `1609.6` | `7420.1` | `11054.6` | `23.950` | `220` |
|
||||
| min32+cap32 | `331.2` | `631.8` | `1512.1` | `7829.2` | `11767.1` | `24.733` | `140` |
|
||||
|
||||
Dense `n=128`, `ptok=168`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| default | `140.3` | `362.7` | `639.8` | `21407.3` | `35811.6` | `58.399` | `0` |
|
||||
| min24 | `140.4` | `347.6` | `658.7` | `22078.2` | `34783.3` | `58.353` | `420` |
|
||||
| min32 | `139.7` | `350.2` | `650.1` | `21221.5` | `35246.3` | `58.642` | `386` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as the best selective
|
||||
scheduler A/B so far, but still opt-in. On MoE `n=128`, min32 improved
|
||||
aggregate throughput (`339.0 -> 341.9`), mean TTFT (`7743.1 -> 7420.1 ms`),
|
||||
max TTFT (`11532.5 -> 11054.6 ms`), and wall time (`24.167 -> 23.950 s`).
|
||||
- Dense `n=128` min32 was mixed: mean/max TTFT improved slightly, but aggregate
|
||||
and wall regressed slightly. Do not default-on yet.
|
||||
- Next step should repeat the MoE min32 result and run the matching vLLM h2h
|
||||
comparison before treating this as real parity progress rather than run noise.
|
||||
|
||||
@@ -34,9 +34,13 @@ Read order for a cold start:
|
||||
> `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and
|
||||
> do not default it broadly. Phase57 tried a per-step defer cap; cap32 improved
|
||||
> MoE mean TTFT in one same-window sweep but still lost aggregate and wall, and
|
||||
> dense caps lost aggregate. Do not repeat capped-defer sweeps as the next parity
|
||||
> path. The trace and scheduler commits are local and DGX-gated but not pushed,
|
||||
> so the LocalAI patch series has not been regenerated.
|
||||
> dense caps lost aggregate. Phase58 added a prompt-backlog threshold; min32
|
||||
> improved MoE `n=128` aggregate `339.0 -> 341.9`, mean TTFT
|
||||
> `7743.1 -> 7420.1 ms`, and wall `24.167 -> 23.950 s` in the same window, while
|
||||
> dense `n=128` was mixed. Next step should repeat min32 and run matching vLLM
|
||||
> h2h before any default-on discussion. The trace and scheduler commits are
|
||||
> local and DGX-gated but not pushed, so the LocalAI patch series has not been
|
||||
> regenerated.
|
||||
|
||||
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
|
||||
unreachable. Treat that as superseded where Phase50-54 provide newer dense
|
||||
|
||||
@@ -1429,6 +1429,40 @@ MoE point, but it trades lower mean TTFT for lower aggregate throughput and
|
||||
higher wall time. Dense caps also lose aggregate. Keep the cap as an opt-in A/B
|
||||
knob only.
|
||||
|
||||
### Phase 58 waiting-threshold TTFT defer
|
||||
|
||||
Phase58 added `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING`, so TTFT prefill-first
|
||||
only activates when the number of prompt-waiting slots is at or above a
|
||||
threshold. Artifact:
|
||||
`/home/mudler/bench/phase58_ttft_waiting_sweep/20260701_122052`.
|
||||
|
||||
Pre/post md5 and op gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`.
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|
|
||||
| default | `339.0` | `648.4` | `1542.9` | `7743.1` | `11532.5` | `24.167` |
|
||||
| min24 | `339.9` | `619.3` | `1637.0` | `7326.6` | `10868.8` | `24.095` |
|
||||
| min32 | `341.9` | `635.0` | `1609.6` | `7420.1` | `11054.6` | `23.950` |
|
||||
| min32+cap32 | `331.2` | `631.8` | `1512.1` | `7829.2` | `11767.1` | `24.733` |
|
||||
|
||||
Dense `n=128`, `ptok=168`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|
|
||||
| default | `140.3` | `362.7` | `639.8` | `21407.3` | `35811.6` | `58.399` |
|
||||
| min24 | `140.4` | `347.6` | `658.7` | `22078.2` | `34783.3` | `58.353` |
|
||||
| min32 | `139.7` | `350.2` | `650.1` | `21221.5` | `35246.3` | `58.642` |
|
||||
|
||||
Decision: `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` is the best selective
|
||||
scheduler A/B so far: MoE `n=128` improved aggregate, TTFT, and wall in the same
|
||||
window, while dense `n=128` was roughly neutral but slightly worse on aggregate
|
||||
and wall. Keep it opt-in until repeated and compared against matching vLLM h2h.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,106 @@
|
||||
# Phase58 TTFT Prefill-First Waiting-Threshold Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Test whether activating TTFT prefill-first only during high prompt-backlog windows keeps the MoE benefit without the broad-defer regressions from Phase56-57.
|
||||
|
||||
**Architecture:** Add `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING` as a default-off refinement. Unset or zero keeps the existing Phase55/57 behavior. Gate with focused tests, then run DGX md5/op gates and same-window MoE/dense threshold sweeps.
|
||||
|
||||
**Tech Stack:** llama.cpp fork, `tools/server/server-admission-policy.h`, `tools/server/server-context.cpp`, DGX GB10, `h2h_cli.py`, `paged-inference-gates.sh`.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add waiting-threshold helper
|
||||
|
||||
- [x] **Step 1: Write red test**
|
||||
|
||||
Added helper expectations:
|
||||
|
||||
- zero waiting threshold defers
|
||||
- at waiting threshold defers
|
||||
- below waiting threshold does not defer
|
||||
|
||||
Observed red failure: no helper overload accepted the waiting-slot threshold
|
||||
signature.
|
||||
|
||||
- [x] **Step 2: Implement threshold helper and env**
|
||||
|
||||
Added `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING`. The scheduler now counts prompt
|
||||
slots in `SLOT_STATE_STARTED` or `SLOT_STATE_PROCESSING_PROMPT` before collecting
|
||||
decode rows and only defers if the waiting count is at or above the threshold.
|
||||
|
||||
- [x] **Step 3: Verify local**
|
||||
|
||||
Commands passed:
|
||||
|
||||
```bash
|
||||
cmake --build build --target test-server-admission-policy test-server-admission-trace llama-server -j2
|
||||
./build/bin/test-server-admission-policy
|
||||
ctest --test-dir build -R 'test-server-admission-(policy|trace)' --output-on-failure
|
||||
```
|
||||
|
||||
- [x] **Step 4: Commit fork patch**
|
||||
|
||||
Local fork commit:
|
||||
|
||||
```text
|
||||
8759213e3 feat(server): gate TTFT defer by prompt backlog
|
||||
```
|
||||
|
||||
### Task 2: DGX gate and threshold sweep
|
||||
|
||||
- [x] **Step 1: Preflight and build**
|
||||
|
||||
Preflight: docker `0`, `local-ai-worker` `0`, compute `0`, lock
|
||||
`FREE released-by-codex-phase57-cap 1782901003`, clean mirror at
|
||||
`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`.
|
||||
|
||||
Applied `/tmp/phase58-ttft-waiting-stack.patch`, built focused tests,
|
||||
`llama-server`, `llama-cli`, and `test-backend-ops`. DGX focused CTests passed.
|
||||
|
||||
- [x] **Step 2: Run pre/post gates**
|
||||
|
||||
Artifact: `/home/mudler/bench/phase58_ttft_waiting_sweep/20260701_122052`.
|
||||
|
||||
Pre and post gates matched:
|
||||
|
||||
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT` `1146/1146`
|
||||
- `MUL_MAT_ID` `806/806`
|
||||
|
||||
- [x] **Step 3: Run MoE threshold sweep**
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| default | `339.0` | `648.4` | `1542.9` | `7743.1` | `11532.5` | `24.167` | `0` |
|
||||
| min24 | `339.9` | `619.3` | `1637.0` | `7326.6` | `10868.8` | `24.095` | `323` |
|
||||
| min32 | `341.9` | `635.0` | `1609.6` | `7420.1` | `11054.6` | `23.950` | `220` |
|
||||
| min32+cap32 | `331.2` | `631.8` | `1512.1` | `7829.2` | `11767.1` | `24.733` | `140` |
|
||||
|
||||
- [x] **Step 4: Run dense threshold sweep**
|
||||
|
||||
Dense `n=128`, `ptok=168`, `gen=64`:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|---------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| default | `140.3` | `362.7` | `639.8` | `21407.3` | `35811.6` | `58.399` | `0` |
|
||||
| min24 | `140.4` | `347.6` | `658.7` | `22078.2` | `34783.3` | `58.353` | `420` |
|
||||
| min32 | `139.7` | `350.2` | `650.1` | `21221.5` | `35246.3` | `58.642` | `386` |
|
||||
|
||||
- [x] **Step 5: Revert DGX stack**
|
||||
|
||||
Reverted the temporary patch stack, removed introduced files, and released the
|
||||
lock as `FREE released-by-codex-phase58-waiting 1782901748`.
|
||||
|
||||
### Task 3: Decision
|
||||
|
||||
- [x] **Step 1: Record outcome**
|
||||
|
||||
Decision: keep the threshold as the best selective TTFT-defer A/B so far, but
|
||||
still opt-in. MoE min32 improved aggregate, mean/max TTFT, and wall in the same
|
||||
window. Dense min32 was roughly neutral with a small TTFT gain but slight
|
||||
aggregate/wall loss. Next step should repeat min32 and compare against vLLM h2h
|
||||
before any default-on discussion.
|
||||
Reference in New Issue
Block a user