mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): reject admission budget sweep
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2938,3 +2938,41 @@ Decision:
|
||||
per-step histogram trace to test whether prefill chunking/admission can reduce
|
||||
TTFT without regressing aggregate throughput. Do not move to another GDN/GEMM
|
||||
rewrite until this scheduler hypothesis is tested.
|
||||
|
||||
## Phase 53 Admission Budget Sweep
|
||||
|
||||
Phase 53 tests the existing default-off admission knobs exposed by patch 0016:
|
||||
`LLAMA_MAX_BATCH_TOKENS` and `LLAMA_PREFILL_CAP`. The question was whether a
|
||||
simple smaller token budget improves dense `n=128` TTFT or aggregate throughput.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`
|
||||
|
||||
Pre/post gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
Results:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s | steps | max waiting prompt slots |
|
||||
|---------|---------|-----------------|---------------------|-------------|--------------|--------|-------|--------------------------|
|
||||
| default Phase52 | `139.0` | `360.5` | `1.93` | `629.5` | `23171.5` | `58.921` | `76` | `35` |
|
||||
| `LLAMA_MAX_BATCH_TOKENS=1536 LLAMA_PREFILL_CAP=512` | `134.4` | `376.7` | `1.82` | `607.0` | `22263.7` | `60.968` | `81` | `26` |
|
||||
| `LLAMA_MAX_BATCH_TOKENS=1024 LLAMA_PREFILL_CAP=512` | `130.0` | `392.4` | `1.82` | `565.2` | `23234.3` | `63.003` | `89` | `16` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Smaller admission budgets reduce the maximum number of waiting prompt slots
|
||||
and raise the h2h `decode_agg_tps` metric, but they reduce aggregate
|
||||
throughput and prefill throughput.
|
||||
- `T=1536` gave only a small TTFT improvement (`23171.5 -> 22263.7 ms`) while
|
||||
worsening wall time and aggregate throughput.
|
||||
- `T=1024` worsened TTFT and aggregate throughput despite the highest
|
||||
`decode_agg_tps`.
|
||||
- Do not promote simple budget shrinkage as a parity lever. The next useful
|
||||
scheduler work is a richer per-step histogram trace or a targeted first-token
|
||||
admission policy, not a static lower `LLAMA_MAX_BATCH_TOKENS`.
|
||||
|
||||
@@ -652,6 +652,19 @@ matches h2h exactly, so this is the target request. The next GB10 lever should
|
||||
be a default-off scheduler/admission A/B or a per-step histogram trace, not an
|
||||
immediate GDN/GEMM rewrite.
|
||||
|
||||
Phase 53 tested the existing runtime admission-budget knobs instead of adding
|
||||
new code. Artifact:
|
||||
`/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`.
|
||||
Pre/post gates stayed green. Dense `n=128` results: default Phase52 `agg=139.0`,
|
||||
`decode_agg=360.5`, `prefill=629.5`, `TTFT=23171.5ms`, wall `58.921s`;
|
||||
`T=1536 cap=512` `agg=134.4`, `decode_agg=376.7`, `prefill=607.0`,
|
||||
`TTFT=22263.7ms`, wall `60.968s`; `T=1024 cap=512` `agg=130.0`,
|
||||
`decode_agg=392.4`, `prefill=565.2`, `TTFT=23234.3ms`, wall `63.003s`.
|
||||
Decision: simple budget shrinkage is rejected. It raises h2h decode-agg while
|
||||
lowering aggregate/prefill throughput, and it does not materially solve TTFT.
|
||||
Next scheduler work should be per-step histograms or a targeted first-token
|
||||
admission policy.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -747,6 +760,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
|
||||
- `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval.
|
||||
- `~/bench/phase52_dense_admission_trace/20260701_111017` - clean dense `n=128` admission trace; pre/post gates green; `decode_only_steps=0`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`; next lever is scheduler/admission A/B or per-step histogram trace.
|
||||
- `~/bench/phase53_dense_admission_budget_sweep/20260701_111915` - runtime sweep of `LLAMA_MAX_BATCH_TOKENS=1536/1024` with `LLAMA_PREFILL_CAP=512`; pre/post gates green; simple budget shrinkage rejected because aggregate/prefill throughput regressed and TTFT did not materially improve.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1267,6 +1267,30 @@ supports the Phase50 conclusion that the remaining high-N serving gap is
|
||||
scheduler/admission and TTFT shaped. Next lever should be a default-off
|
||||
admission-policy A/B or per-step histogram trace, not immediate kernel work.
|
||||
|
||||
### Phase 53 admission budget sweep
|
||||
|
||||
Phase53 tested the already-existing default-off budget knobs:
|
||||
`LLAMA_MAX_BATCH_TOKENS=1536/1024` with `LLAMA_PREFILL_CAP=512`, using the same
|
||||
dense `n=128`, `ptok=128`, `gen=64` traced serving shape. Artifact:
|
||||
`/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`.
|
||||
|
||||
Pre/post md5 and op gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
|
||||
`MUL_MAT_ID` `806/806`.
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | wall s | max waiting prompt slots |
|
||||
|---------|---------|-----------------|-------------|--------------|--------|--------------------------|
|
||||
| default Phase52 | `139.0` | `360.5` | `629.5` | `23171.5` | `58.921` | `35` |
|
||||
| `T=1536 cap=512` | `134.4` | `376.7` | `607.0` | `22263.7` | `60.968` | `26` |
|
||||
| `T=1024 cap=512` | `130.0` | `392.4` | `565.2` | `23234.3` | `63.003` | `16` |
|
||||
|
||||
Decision: simple budget shrinkage is rejected as a parity lever. It improves
|
||||
the h2h decode-agg metric by starving/slimming prompt admission, but aggregate
|
||||
throughput and prefill throughput fall, and TTFT does not materially improve.
|
||||
Next scheduler work should collect per-step histograms or test a targeted
|
||||
first-token admission policy.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,99 @@
|
||||
# Phase53 Admission Budget Sweep Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Test whether existing default-off scheduler knobs (`LLAMA_MAX_BATCH_TOKENS`, `LLAMA_PREFILL_CAP`) improve dense `n=128` serving enough to pursue a scheduler policy patch.
|
||||
|
||||
**Architecture:** Temporarily apply the Phase51 trace patch to the clean DGX mirror, build the patched server, bracket the sweep with canonical md5/op gates, run dense `n=128`, `ptok=128`, `gen=64` variants, parse h2h plus admission trace rows, then revert the DGX mirror.
|
||||
|
||||
**Tech Stack:** DGX GB10, llama.cpp `build-cuda`, `LLAMA_SERVING_TRACE=1`, `h2h_cli3.py`, `paged-inference-gates.sh`.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Prepare patched DGX trace build
|
||||
|
||||
- [x] **Step 1: Check preflight**
|
||||
|
||||
Artifact: `/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`.
|
||||
Preflight: docker `0`, `local-ai-worker` `0`, compute `0`, owner
|
||||
`FREE released-by-codex-phase52-dense-admission-trace-clean 1782897309`.
|
||||
|
||||
- [x] **Step 2: Apply Phase51 patch and build**
|
||||
|
||||
Applied `/tmp/phase51-serving-admission-trace.patch` to
|
||||
`~/llama-phase6-source`. Built `llama-server`, `llama-completion`, and
|
||||
`test-backend-ops` in `build-cuda`.
|
||||
|
||||
### Task 2: Gate before sweep
|
||||
|
||||
- [x] **Step 1: Run canonical pre-sweep gate**
|
||||
|
||||
Observed:
|
||||
|
||||
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT` `1146/1146`
|
||||
- `MUL_MAT_ID` `806/806`
|
||||
|
||||
### Task 3: Run budget variants
|
||||
|
||||
- [x] **Step 1: Run `T=1536`, `cap=512`**
|
||||
|
||||
Environment: `LLAMA_MAX_BATCH_TOKENS=1536 LLAMA_PREFILL_CAP=512`.
|
||||
|
||||
Result:
|
||||
|
||||
```text
|
||||
agg=134.4 decode_agg=376.7 perseq=1.82 prefill=607.0 ttft=22263.7 wall=60.968
|
||||
steps=81 decode_only_steps=0 prompt_tokens=23809 max_waiting_prompt_slots=26 prefill_budget_step=1535 prefill_cap_per_slot=512
|
||||
```
|
||||
|
||||
- [x] **Step 2: Run `T=1024`, `cap=512`**
|
||||
|
||||
Environment: `LLAMA_MAX_BATCH_TOKENS=1024 LLAMA_PREFILL_CAP=512`.
|
||||
|
||||
Result:
|
||||
|
||||
```text
|
||||
agg=130.0 decode_agg=392.4 perseq=1.82 prefill=565.2 ttft=23234.3 wall=63.003
|
||||
steps=89 decode_only_steps=0 prompt_tokens=23809 max_waiting_prompt_slots=16 prefill_budget_step=1021 prefill_cap_per_slot=512
|
||||
```
|
||||
|
||||
### Task 4: Parse and decide
|
||||
|
||||
- [x] **Step 1: Write `summary.tsv`**
|
||||
|
||||
Summary:
|
||||
|
||||
| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | wall s | steps | max waiting prompt slots |
|
||||
|---------|---------|-----------------|-------------|--------------|--------|-------|--------------------------|
|
||||
| default Phase52 | `139.0` | `360.5` | `629.5` | `23171.5` | `58.921` | `76` | `35` |
|
||||
| `T=1536 cap=512` | `134.4` | `376.7` | `607.0` | `22263.7` | `60.968` | `81` | `26` |
|
||||
| `T=1024 cap=512` | `130.0` | `392.4` | `565.2` | `23234.3` | `63.003` | `89` | `16` |
|
||||
|
||||
Decision: simple budget shrinkage trades aggregate/prefill throughput for a
|
||||
higher h2h decode-agg metric and does not materially solve TTFT. Do not promote
|
||||
these knobs as a parity lever. The next step should be either per-step histogram
|
||||
tracing or a more targeted policy that improves first-token admission without
|
||||
starving prefill throughput.
|
||||
|
||||
### Task 5: Gate after sweep and clean DGX
|
||||
|
||||
- [x] **Step 1: Run canonical post-sweep gate**
|
||||
|
||||
Observed:
|
||||
|
||||
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT` `1146/1146`
|
||||
- `MUL_MAT_ID` `806/806`
|
||||
|
||||
- [x] **Step 2: Revert temporary DGX patch**
|
||||
|
||||
Reverted the Phase51 patch from `~/llama-phase6-source`. Final DGX state:
|
||||
docker `0`, `local-ai-worker` `0`, compute `0`, owner
|
||||
`FREE released-by-codex-phase53-budget-sweep 1782897825`.
|
||||
|
||||
- [x] **Step 3: Commit docs**
|
||||
|
||||
Commit this plan and parity doc updates with `Assisted-by: Codex:gpt-5`.
|
||||
Reference in New Issue
Block a user