docs(paged): record serving admission trace

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 09:08:42 +00:00
parent c299dcd231
commit b5f65152e2
4 changed files with 228 additions and 0 deletions

View File

@@ -2833,3 +2833,49 @@ Interpretation:
measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
graph reuse, and prefill starvation. Do not start with another GDN or GEMM
rewrite unless that trace rules the scheduler out.
## Phase 51 Serving Admission Trace
Phase 51 implements the Phase50 next step in the llama.cpp fork. This is a
trace-only change, gated behind `LLAMA_SERVING_TRACE=1`; default inference and
batch scheduling are unchanged.
Fork commit:
- `/home/mudler/_git/llama.cpp` `localai-paged`
- `c6cb8460e feat(server): trace serving admission batches`
Change:
- Add `tools/server/server-admission-trace.h` with a small accumulator and
formatter.
- Add `tests/test-server-admission-trace.cpp` and CMake target coverage.
- Wire counters into `server_context_impl::pre_decode()` for:
decode tokens already in the batch, prompt tokens admitted, waiting prompt
slots, started/continued prompt slots, decode-only steps, `n_batch`,
`n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`.
- Print one aggregate summary when the server context is destroyed, only when
`LLAMA_SERVING_TRACE=1` and at least one scheduler step was observed.
Verification:
- Red test first: `test-server-admission-trace` failed to build before
`server-admission-trace.h` existed.
- Local fork: `test-server-admission-trace` built and passed, `llama-server`
built, and `ctest --test-dir build -R '^test-server-admission-trace$'`
passed.
- DGX artifact:
`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
- DGX `build-cuda`: `test-server-admission-trace` and `llama-server` built;
CTest passed.
- DGX inference gates on the patched `build-cuda` build passed: MoE md5
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
`MUL_MAT_ID` `806/806`.
Mirror status:
- The fork commit is local and DGX-gated.
- The LocalAI `patches/paged/` series is not regenerated yet because the
handoff requires pushing the fork branch first, and pushes require explicit
approval.

View File

@@ -626,6 +626,20 @@ prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
GB10 code work should instrument batch composition/admission in
`server_context::pre_decode()` before attempting another kernel shortcut.
Phase 51 implements that admission trace in the llama.cpp fork. Local fork
commit: `c6cb8460e feat(server): trace serving admission batches`. The trace is
default-off behind `LLAMA_SERVING_TRACE=1`, adds a small unit-tested accumulator,
and records aggregate `pre_decode()` scheduler shape: decode tokens, prompt
tokens admitted, waiting prompt slots, started/continued prompt slots,
decode-only steps, `n_batch`, `n_ubatch`, `prefill_budget_step`, and
`prefill_cap_per_slot`. DGX artifact:
`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`. The
patched `build-cuda` CTest passed and inference gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
`806/806`. Push and LocalAI patch-series regeneration are still pending because
push requires explicit approval.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -719,6 +733,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
- `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -1216,6 +1216,33 @@ TTFT accounting. Next implementation target should be an opt-in
batch-composition/admission trace in `server_context::pre_decode()` before any
new GDN/GEMM shortcut.
### Phase 51 serving admission trace
Phase51 adds that trace in the llama.cpp fork. Fork commit:
`c6cb8460e feat(server): trace serving admission batches`.
The change is default-off behind `LLAMA_SERVING_TRACE=1` and does not change
inference decisions. It records aggregate scheduler-shape counters from
`server_context_impl::pre_decode()`: decode tokens, prompt tokens admitted,
waiting prompt slots, started/continued prompt slots, decode-only steps,
`n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`.
Verification:
- Red test first: `test-server-admission-trace` failed before
`server-admission-trace.h` existed.
- Local fork: unit test and `llama-server` build passed.
- DGX artifact:
`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
- DGX patched `build-cuda` CTest passed.
- DGX patched `build-cuda` inference gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
`MUL_MAT_ID` `806/806`.
Mirror status: pending explicit approval to push the fork branch, then
regenerate the LocalAI patch series from the pushed fork commit.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,140 @@
# Phase51 Serving Admission Trace Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add an opt-in llama.cpp server trace that reports serving batch admission shape so dense high-N TTFT/aggregate gaps can be separated from true GPU decode speed.
**Architecture:** Implement fork-first on `mudler/llama.cpp:localai-paged`. Keep inference behavior unchanged by gating the trace behind `LLAMA_SERVING_TRACE`. Add a small unit-tested formatter/accumulator and wire counters into `server_context_impl::pre_decode()` without changing scheduling predicates.
**Tech Stack:** llama.cpp fork, `tools/server/server-context.cpp`, CMake unit test, DGX GB10 `build-cuda`, canonical md5 and backend-op gates.
---
### Task 1: Add red unit test
**Files:**
- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt`
- Create: `/home/mudler/_git/llama.cpp/tests/test-server-admission-trace.cpp`
- [x] **Step 1: Add the test target and assertions**
Added `test-server-admission-trace.cpp`, asserting summary output includes
`steps`, `decode_only_steps`, `decode_tokens`, `prompt_tokens`,
`max_waiting_prompt_slots`, `started_prompt_slots`, `continued_prompt_slots`,
`last_n_batch`, `last_n_ubatch`, `last_prefill_budget_step`, and
`last_prefill_cap_per_slot`.
- [x] **Step 2: Verify red**
Run:
```bash
cmake -S . -B build >/tmp/llama-phase51-cmake.log
cmake --build build --target test-server-admission-trace -j2
```
Expected and observed: build failed because
`../tools/server/server-admission-trace.h` did not exist.
### Task 2: Implement opt-in trace
**Files:**
- Create: `/home/mudler/_git/llama.cpp/tools/server/server-admission-trace.h`
- Modify: `/home/mudler/_git/llama.cpp/tools/server/CMakeLists.txt`
- Modify: `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
- [x] **Step 1: Add accumulator and formatter**
Added `server_admission_trace_step`, `server_admission_trace_totals`,
`server_admission_trace_accumulate()`, and `server_admission_trace_format()`.
- [x] **Step 2: Wire counters into `pre_decode()`**
`LLAMA_SERVING_TRACE=1` now tracks:
- decode tokens already in the batch
- prompt tokens admitted this step
- waiting prompt slots seen by the prompt-admission loop
- started and continued prompt slots that actually admitted prompt tokens
- decode-only steps
- `n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`
The trace is printed once from `server_context_impl` destruction when enabled
and at least one step was observed.
### Task 3: Verify locally and on DGX
**Files:**
- DGX artifact: `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
- [x] **Step 1: Run local unit and server build**
Commands:
```bash
cmake -S . -B build >/tmp/llama-phase51-cmake.log
cmake --build build --target test-server-admission-trace -j2
./build/bin/test-server-admission-trace
cmake --build build --target llama-server -j2
ctest --test-dir build -R '^test-server-admission-trace$' --output-on-failure
```
Observed: unit test passed, `llama-server` built, CTest passed.
- [x] **Step 2: Apply patch to DGX mirror and build**
Applied the local patch to `dgx:~/llama-phase6-source`, then ran:
```bash
cmake -S . -B build-cuda
cmake --build build-cuda --target test-server-admission-trace llama-server -j2
ctest --test-dir build-cuda -R '^test-server-admission-trace$' --output-on-failure
```
Observed: DGX CTest passed.
- [x] **Step 3: Run canonical inference gate**
Run:
```bash
BIN=$HOME/llama-phase6-source/build-cuda/bin \
ART=$HOME/bench/phase51_serving_admission_trace/20260701_110130/gate_post \
OPS=MUL_MAT,MUL_MAT_ID \
$HOME/paged-inference-gates.sh
```
Observed:
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT` `1146/1146`
- `MUL_MAT_ID` `806/806`
### Task 4: Commit and mirror
**Files:**
- Modify later: `backend/cpp/llama-cpp-localai-paged/patches/paged/`
- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- [x] **Step 1: Commit on the llama.cpp fork**
Local fork commit:
```text
c6cb8460e feat(server): trace serving admission batches
```
- [ ] **Step 2: Push fork branch**
Blocked by policy: ask before every push. Do not push without explicit approval.
- [ ] **Step 3: Regenerate LocalAI patch series**
Pending until the fork branch is pushed, per the fork-first mirror invariant.
- [x] **Step 4: Record Phase51 status in LocalAI docs**
Record the fork commit, DGX artifact, gates, and pending push/mirror state.