docs(paged): record serving admission trace

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 09:08:42 +00:00
parent c299dcd231
commit b5f65152e2
4 changed files with 228 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -2833,3 +2833,49 @@ Interpretation:
  measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
  graph reuse, and prefill starvation. Do not start with another GDN or GEMM
  rewrite unless that trace rules the scheduler out.
+
+## Phase 51 Serving Admission Trace
+
+Phase 51 implements the Phase50 next step in the llama.cpp fork. This is a
+trace-only change, gated behind `LLAMA_SERVING_TRACE=1`; default inference and
+batch scheduling are unchanged.
+
+Fork commit:
+
+- `/home/mudler/_git/llama.cpp` `localai-paged`
+- `c6cb8460e feat(server): trace serving admission batches`
+
+Change:
+
+- Add `tools/server/server-admission-trace.h` with a small accumulator and
+  formatter.
+- Add `tests/test-server-admission-trace.cpp` and CMake target coverage.
+- Wire counters into `server_context_impl::pre_decode()` for:
+  decode tokens already in the batch, prompt tokens admitted, waiting prompt
+  slots, started/continued prompt slots, decode-only steps, `n_batch`,
+  `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`.
+- Print one aggregate summary when the server context is destroyed, only when
+  `LLAMA_SERVING_TRACE=1` and at least one scheduler step was observed.
+
+Verification:
+
+- Red test first: `test-server-admission-trace` failed to build before
+  `server-admission-trace.h` existed.
+- Local fork: `test-server-admission-trace` built and passed, `llama-server`
+  built, and `ctest --test-dir build -R '^test-server-admission-trace$'`
+  passed.
+- DGX artifact:
+  `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
+- DGX `build-cuda`: `test-server-admission-trace` and `llama-server` built;
+  CTest passed.
+- DGX inference gates on the patched `build-cuda` build passed: MoE md5
+  `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+  `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
+  `MUL_MAT_ID` `806/806`.
+
+Mirror status:
+
+- The fork commit is local and DGX-gated.
+- The LocalAI `patches/paged/` series is not regenerated yet because the
+  handoff requires pushing the fork branch first, and pushes require explicit
+  approval.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -626,6 +626,20 @@ prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
 GB10 code work should instrument batch composition/admission in
 `server_context::pre_decode()` before attempting another kernel shortcut.

+Phase 51 implements that admission trace in the llama.cpp fork. Local fork
+commit: `c6cb8460e feat(server): trace serving admission batches`. The trace is
+default-off behind `LLAMA_SERVING_TRACE=1`, adds a small unit-tested accumulator,
+and records aggregate `pre_decode()` scheduler shape: decode tokens, prompt
+tokens admitted, waiting prompt slots, started/continued prompt slots,
+decode-only steps, `n_batch`, `n_ubatch`, `prefill_budget_step`, and
+`prefill_cap_per_slot`. DGX artifact:
+`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`. The
+patched `build-cuda` CTest passed and inference gates stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. Push and LocalAI patch-series regeneration are still pending because
+push requires explicit approval.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -719,6 +733,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
 - `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
 - `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
+- `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1216,6 +1216,33 @@ TTFT accounting. Next implementation target should be an opt-in
 batch-composition/admission trace in `server_context::pre_decode()` before any
 new GDN/GEMM shortcut.

+### Phase 51 serving admission trace
+
+Phase51 adds that trace in the llama.cpp fork. Fork commit:
+`c6cb8460e feat(server): trace serving admission batches`.
+
+The change is default-off behind `LLAMA_SERVING_TRACE=1` and does not change
+inference decisions. It records aggregate scheduler-shape counters from
+`server_context_impl::pre_decode()`: decode tokens, prompt tokens admitted,
+waiting prompt slots, started/continued prompt slots, decode-only steps,
+`n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`.
+
+Verification:
+
+- Red test first: `test-server-admission-trace` failed before
+  `server-admission-trace.h` existed.
+- Local fork: unit test and `llama-server` build passed.
+- DGX artifact:
+  `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
+- DGX patched `build-cuda` CTest passed.
+- DGX patched `build-cuda` inference gates stayed green: MoE
+  `8cb0ce23777bf55f92f63d0292c756b0`, dense
+  `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
+  `MUL_MAT_ID` `806/806`.
+
+Mirror status: pending explicit approval to push the fork branch, then
+regenerate the LocalAI patch series from the pushed fork commit.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update