diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index f3dc3820d..bff1af61f 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2833,3 +2833,49 @@ Interpretation: measure decode tokens admitted, prompt tokens admitted, waiting prompt slots, graph reuse, and prefill starvation. Do not start with another GDN or GEMM rewrite unless that trace rules the scheduler out. + +## Phase 51 Serving Admission Trace + +Phase 51 implements the Phase50 next step in the llama.cpp fork. This is a +trace-only change, gated behind `LLAMA_SERVING_TRACE=1`; default inference and +batch scheduling are unchanged. + +Fork commit: + +- `/home/mudler/_git/llama.cpp` `localai-paged` +- `c6cb8460e feat(server): trace serving admission batches` + +Change: + +- Add `tools/server/server-admission-trace.h` with a small accumulator and + formatter. +- Add `tests/test-server-admission-trace.cpp` and CMake target coverage. +- Wire counters into `server_context_impl::pre_decode()` for: + decode tokens already in the batch, prompt tokens admitted, waiting prompt + slots, started/continued prompt slots, decode-only steps, `n_batch`, + `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`. +- Print one aggregate summary when the server context is destroyed, only when + `LLAMA_SERVING_TRACE=1` and at least one scheduler step was observed. + +Verification: + +- Red test first: `test-server-admission-trace` failed to build before + `server-admission-trace.h` existed. +- Local fork: `test-server-admission-trace` built and passed, `llama-server` + built, and `ctest --test-dir build -R '^test-server-admission-trace$'` + passed. +- DGX artifact: + `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130` +- DGX `build-cuda`: `test-server-admission-trace` and `llama-server` built; + CTest passed. +- DGX inference gates on the patched `build-cuda` build passed: MoE md5 + `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and + `MUL_MAT_ID` `806/806`. + +Mirror status: + +- The fork commit is local and DGX-gated. +- The LocalAI `patches/paged/` series is not regenerated yet because the + handoff requires pushing the fork branch first, and pushes require explicit + approval. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 23cc982c3..eb68bbb19 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -626,6 +626,20 @@ prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next GB10 code work should instrument batch composition/admission in `server_context::pre_decode()` before attempting another kernel shortcut. +Phase 51 implements that admission trace in the llama.cpp fork. Local fork +commit: `c6cb8460e feat(server): trace serving admission batches`. The trace is +default-off behind `LLAMA_SERVING_TRACE=1`, adds a small unit-tested accumulator, +and records aggregate `pre_decode()` scheduler shape: decode tokens, prompt +tokens admitted, waiting prompt slots, started/continued prompt slots, +decode-only steps, `n_batch`, `n_ubatch`, `prefill_budget_step`, and +`prefill_cap_per_slot`. DGX artifact: +`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`. The +patched `build-cuda` CTest passed and inference gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Push and LocalAI patch-series regeneration are still pending because +push requires explicit approval. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -719,6 +733,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind. - `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment. - `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing. +- `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index e665f558b..15bb6c7b7 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1216,6 +1216,33 @@ TTFT accounting. Next implementation target should be an opt-in batch-composition/admission trace in `server_context::pre_decode()` before any new GDN/GEMM shortcut. +### Phase 51 serving admission trace + +Phase51 adds that trace in the llama.cpp fork. Fork commit: +`c6cb8460e feat(server): trace serving admission batches`. + +The change is default-off behind `LLAMA_SERVING_TRACE=1` and does not change +inference decisions. It records aggregate scheduler-shape counters from +`server_context_impl::pre_decode()`: decode tokens, prompt tokens admitted, +waiting prompt slots, started/continued prompt slots, decode-only steps, +`n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`. + +Verification: + +- Red test first: `test-server-admission-trace` failed before + `server-admission-trace.h` existed. +- Local fork: unit test and `llama-server` build passed. +- DGX artifact: + `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130` +- DGX patched `build-cuda` CTest passed. +- DGX patched `build-cuda` inference gates stayed green: MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and + `MUL_MAT_ID` `806/806`. + +Mirror status: pending explicit approval to push the fork branch, then +regenerate the LocalAI patch series from the pushed fork commit. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-serving-admission-trace-phase51.md b/docs/superpowers/plans/2026-07-01-serving-admission-trace-phase51.md new file mode 100644 index 000000000..daf712056 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-serving-admission-trace-phase51.md @@ -0,0 +1,140 @@ +# Phase51 Serving Admission Trace Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add an opt-in llama.cpp server trace that reports serving batch admission shape so dense high-N TTFT/aggregate gaps can be separated from true GPU decode speed. + +**Architecture:** Implement fork-first on `mudler/llama.cpp:localai-paged`. Keep inference behavior unchanged by gating the trace behind `LLAMA_SERVING_TRACE`. Add a small unit-tested formatter/accumulator and wire counters into `server_context_impl::pre_decode()` without changing scheduling predicates. + +**Tech Stack:** llama.cpp fork, `tools/server/server-context.cpp`, CMake unit test, DGX GB10 `build-cuda`, canonical md5 and backend-op gates. + +--- + +### Task 1: Add red unit test + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` +- Create: `/home/mudler/_git/llama.cpp/tests/test-server-admission-trace.cpp` + +- [x] **Step 1: Add the test target and assertions** + +Added `test-server-admission-trace.cpp`, asserting summary output includes +`steps`, `decode_only_steps`, `decode_tokens`, `prompt_tokens`, +`max_waiting_prompt_slots`, `started_prompt_slots`, `continued_prompt_slots`, +`last_n_batch`, `last_n_ubatch`, `last_prefill_budget_step`, and +`last_prefill_cap_per_slot`. + +- [x] **Step 2: Verify red** + +Run: + +```bash +cmake -S . -B build >/tmp/llama-phase51-cmake.log +cmake --build build --target test-server-admission-trace -j2 +``` + +Expected and observed: build failed because +`../tools/server/server-admission-trace.h` did not exist. + +### Task 2: Implement opt-in trace + +**Files:** +- Create: `/home/mudler/_git/llama.cpp/tools/server/server-admission-trace.h` +- Modify: `/home/mudler/_git/llama.cpp/tools/server/CMakeLists.txt` +- Modify: `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + +- [x] **Step 1: Add accumulator and formatter** + +Added `server_admission_trace_step`, `server_admission_trace_totals`, +`server_admission_trace_accumulate()`, and `server_admission_trace_format()`. + +- [x] **Step 2: Wire counters into `pre_decode()`** + +`LLAMA_SERVING_TRACE=1` now tracks: + +- decode tokens already in the batch +- prompt tokens admitted this step +- waiting prompt slots seen by the prompt-admission loop +- started and continued prompt slots that actually admitted prompt tokens +- decode-only steps +- `n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot` + +The trace is printed once from `server_context_impl` destruction when enabled +and at least one step was observed. + +### Task 3: Verify locally and on DGX + +**Files:** +- DGX artifact: `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130` + +- [x] **Step 1: Run local unit and server build** + +Commands: + +```bash +cmake -S . -B build >/tmp/llama-phase51-cmake.log +cmake --build build --target test-server-admission-trace -j2 +./build/bin/test-server-admission-trace +cmake --build build --target llama-server -j2 +ctest --test-dir build -R '^test-server-admission-trace$' --output-on-failure +``` + +Observed: unit test passed, `llama-server` built, CTest passed. + +- [x] **Step 2: Apply patch to DGX mirror and build** + +Applied the local patch to `dgx:~/llama-phase6-source`, then ran: + +```bash +cmake -S . -B build-cuda +cmake --build build-cuda --target test-server-admission-trace llama-server -j2 +ctest --test-dir build-cuda -R '^test-server-admission-trace$' --output-on-failure +``` + +Observed: DGX CTest passed. + +- [x] **Step 3: Run canonical inference gate** + +Run: + +```bash +BIN=$HOME/llama-phase6-source/build-cuda/bin \ +ART=$HOME/bench/phase51_serving_admission_trace/20260701_110130/gate_post \ +OPS=MUL_MAT,MUL_MAT_ID \ + $HOME/paged-inference-gates.sh +``` + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +### Task 4: Commit and mirror + +**Files:** +- Modify later: `backend/cpp/llama-cpp-localai-paged/patches/paged/` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Commit on the llama.cpp fork** + +Local fork commit: + +```text +c6cb8460e feat(server): trace serving admission batches +``` + +- [ ] **Step 2: Push fork branch** + +Blocked by policy: ask before every push. Do not push without explicit approval. + +- [ ] **Step 3: Regenerate LocalAI patch series** + +Pending until the fork branch is pushed, per the fork-first mirror invariant. + +- [x] **Step 4: Record Phase51 status in LocalAI docs** + +Record the fork commit, DGX artifact, gates, and pending push/mirror state.