mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): record serving admission trace
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2833,3 +2833,49 @@ Interpretation:
|
||||
measure decode tokens admitted, prompt tokens admitted, waiting prompt slots,
|
||||
graph reuse, and prefill starvation. Do not start with another GDN or GEMM
|
||||
rewrite unless that trace rules the scheduler out.
|
||||
|
||||
## Phase 51 Serving Admission Trace
|
||||
|
||||
Phase 51 implements the Phase50 next step in the llama.cpp fork. This is a
|
||||
trace-only change, gated behind `LLAMA_SERVING_TRACE=1`; default inference and
|
||||
batch scheduling are unchanged.
|
||||
|
||||
Fork commit:
|
||||
|
||||
- `/home/mudler/_git/llama.cpp` `localai-paged`
|
||||
- `c6cb8460e feat(server): trace serving admission batches`
|
||||
|
||||
Change:
|
||||
|
||||
- Add `tools/server/server-admission-trace.h` with a small accumulator and
|
||||
formatter.
|
||||
- Add `tests/test-server-admission-trace.cpp` and CMake target coverage.
|
||||
- Wire counters into `server_context_impl::pre_decode()` for:
|
||||
decode tokens already in the batch, prompt tokens admitted, waiting prompt
|
||||
slots, started/continued prompt slots, decode-only steps, `n_batch`,
|
||||
`n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`.
|
||||
- Print one aggregate summary when the server context is destroyed, only when
|
||||
`LLAMA_SERVING_TRACE=1` and at least one scheduler step was observed.
|
||||
|
||||
Verification:
|
||||
|
||||
- Red test first: `test-server-admission-trace` failed to build before
|
||||
`server-admission-trace.h` existed.
|
||||
- Local fork: `test-server-admission-trace` built and passed, `llama-server`
|
||||
built, and `ctest --test-dir build -R '^test-server-admission-trace$'`
|
||||
passed.
|
||||
- DGX artifact:
|
||||
`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
|
||||
- DGX `build-cuda`: `test-server-admission-trace` and `llama-server` built;
|
||||
CTest passed.
|
||||
- DGX inference gates on the patched `build-cuda` build passed: MoE md5
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
|
||||
`MUL_MAT_ID` `806/806`.
|
||||
|
||||
Mirror status:
|
||||
|
||||
- The fork commit is local and DGX-gated.
|
||||
- The LocalAI `patches/paged/` series is not regenerated yet because the
|
||||
handoff requires pushing the fork branch first, and pushes require explicit
|
||||
approval.
|
||||
|
||||
@@ -626,6 +626,20 @@ prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next
|
||||
GB10 code work should instrument batch composition/admission in
|
||||
`server_context::pre_decode()` before attempting another kernel shortcut.
|
||||
|
||||
Phase 51 implements that admission trace in the llama.cpp fork. Local fork
|
||||
commit: `c6cb8460e feat(server): trace serving admission batches`. The trace is
|
||||
default-off behind `LLAMA_SERVING_TRACE=1`, adds a small unit-tested accumulator,
|
||||
and records aggregate `pre_decode()` scheduler shape: decode tokens, prompt
|
||||
tokens admitted, waiting prompt slots, started/continued prompt slots,
|
||||
decode-only steps, `n_batch`, `n_ubatch`, `prefill_budget_step`, and
|
||||
`prefill_cap_per_slot`. DGX artifact:
|
||||
`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`. The
|
||||
patched `build-cuda` CTest passed and inference gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
|
||||
`806/806`. Push and LocalAI patch-series regeneration are still pending because
|
||||
push requires explicit approval.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -719,6 +733,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind.
|
||||
- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
|
||||
- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
|
||||
- `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1216,6 +1216,33 @@ TTFT accounting. Next implementation target should be an opt-in
|
||||
batch-composition/admission trace in `server_context::pre_decode()` before any
|
||||
new GDN/GEMM shortcut.
|
||||
|
||||
### Phase 51 serving admission trace
|
||||
|
||||
Phase51 adds that trace in the llama.cpp fork. Fork commit:
|
||||
`c6cb8460e feat(server): trace serving admission batches`.
|
||||
|
||||
The change is default-off behind `LLAMA_SERVING_TRACE=1` and does not change
|
||||
inference decisions. It records aggregate scheduler-shape counters from
|
||||
`server_context_impl::pre_decode()`: decode tokens, prompt tokens admitted,
|
||||
waiting prompt slots, started/continued prompt slots, decode-only steps,
|
||||
`n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`.
|
||||
|
||||
Verification:
|
||||
|
||||
- Red test first: `test-server-admission-trace` failed before
|
||||
`server-admission-trace.h` existed.
|
||||
- Local fork: unit test and `llama-server` build passed.
|
||||
- DGX artifact:
|
||||
`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
|
||||
- DGX patched `build-cuda` CTest passed.
|
||||
- DGX patched `build-cuda` inference gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
|
||||
`MUL_MAT_ID` `806/806`.
|
||||
|
||||
Mirror status: pending explicit approval to push the fork branch, then
|
||||
regenerate the LocalAI patch series from the pushed fork commit.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,140 @@
|
||||
# Phase51 Serving Admission Trace Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Add an opt-in llama.cpp server trace that reports serving batch admission shape so dense high-N TTFT/aggregate gaps can be separated from true GPU decode speed.
|
||||
|
||||
**Architecture:** Implement fork-first on `mudler/llama.cpp:localai-paged`. Keep inference behavior unchanged by gating the trace behind `LLAMA_SERVING_TRACE`. Add a small unit-tested formatter/accumulator and wire counters into `server_context_impl::pre_decode()` without changing scheduling predicates.
|
||||
|
||||
**Tech Stack:** llama.cpp fork, `tools/server/server-context.cpp`, CMake unit test, DGX GB10 `build-cuda`, canonical md5 and backend-op gates.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Add red unit test
|
||||
|
||||
**Files:**
|
||||
- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt`
|
||||
- Create: `/home/mudler/_git/llama.cpp/tests/test-server-admission-trace.cpp`
|
||||
|
||||
- [x] **Step 1: Add the test target and assertions**
|
||||
|
||||
Added `test-server-admission-trace.cpp`, asserting summary output includes
|
||||
`steps`, `decode_only_steps`, `decode_tokens`, `prompt_tokens`,
|
||||
`max_waiting_prompt_slots`, `started_prompt_slots`, `continued_prompt_slots`,
|
||||
`last_n_batch`, `last_n_ubatch`, `last_prefill_budget_step`, and
|
||||
`last_prefill_cap_per_slot`.
|
||||
|
||||
- [x] **Step 2: Verify red**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
cmake -S . -B build >/tmp/llama-phase51-cmake.log
|
||||
cmake --build build --target test-server-admission-trace -j2
|
||||
```
|
||||
|
||||
Expected and observed: build failed because
|
||||
`../tools/server/server-admission-trace.h` did not exist.
|
||||
|
||||
### Task 2: Implement opt-in trace
|
||||
|
||||
**Files:**
|
||||
- Create: `/home/mudler/_git/llama.cpp/tools/server/server-admission-trace.h`
|
||||
- Modify: `/home/mudler/_git/llama.cpp/tools/server/CMakeLists.txt`
|
||||
- Modify: `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp`
|
||||
|
||||
- [x] **Step 1: Add accumulator and formatter**
|
||||
|
||||
Added `server_admission_trace_step`, `server_admission_trace_totals`,
|
||||
`server_admission_trace_accumulate()`, and `server_admission_trace_format()`.
|
||||
|
||||
- [x] **Step 2: Wire counters into `pre_decode()`**
|
||||
|
||||
`LLAMA_SERVING_TRACE=1` now tracks:
|
||||
|
||||
- decode tokens already in the batch
|
||||
- prompt tokens admitted this step
|
||||
- waiting prompt slots seen by the prompt-admission loop
|
||||
- started and continued prompt slots that actually admitted prompt tokens
|
||||
- decode-only steps
|
||||
- `n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`
|
||||
|
||||
The trace is printed once from `server_context_impl` destruction when enabled
|
||||
and at least one step was observed.
|
||||
|
||||
### Task 3: Verify locally and on DGX
|
||||
|
||||
**Files:**
|
||||
- DGX artifact: `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`
|
||||
|
||||
- [x] **Step 1: Run local unit and server build**
|
||||
|
||||
Commands:
|
||||
|
||||
```bash
|
||||
cmake -S . -B build >/tmp/llama-phase51-cmake.log
|
||||
cmake --build build --target test-server-admission-trace -j2
|
||||
./build/bin/test-server-admission-trace
|
||||
cmake --build build --target llama-server -j2
|
||||
ctest --test-dir build -R '^test-server-admission-trace$' --output-on-failure
|
||||
```
|
||||
|
||||
Observed: unit test passed, `llama-server` built, CTest passed.
|
||||
|
||||
- [x] **Step 2: Apply patch to DGX mirror and build**
|
||||
|
||||
Applied the local patch to `dgx:~/llama-phase6-source`, then ran:
|
||||
|
||||
```bash
|
||||
cmake -S . -B build-cuda
|
||||
cmake --build build-cuda --target test-server-admission-trace llama-server -j2
|
||||
ctest --test-dir build-cuda -R '^test-server-admission-trace$' --output-on-failure
|
||||
```
|
||||
|
||||
Observed: DGX CTest passed.
|
||||
|
||||
- [x] **Step 3: Run canonical inference gate**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
BIN=$HOME/llama-phase6-source/build-cuda/bin \
|
||||
ART=$HOME/bench/phase51_serving_admission_trace/20260701_110130/gate_post \
|
||||
OPS=MUL_MAT,MUL_MAT_ID \
|
||||
$HOME/paged-inference-gates.sh
|
||||
```
|
||||
|
||||
Observed:
|
||||
|
||||
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT` `1146/1146`
|
||||
- `MUL_MAT_ID` `806/806`
|
||||
|
||||
### Task 4: Commit and mirror
|
||||
|
||||
**Files:**
|
||||
- Modify later: `backend/cpp/llama-cpp-localai-paged/patches/paged/`
|
||||
- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
|
||||
- [x] **Step 1: Commit on the llama.cpp fork**
|
||||
|
||||
Local fork commit:
|
||||
|
||||
```text
|
||||
c6cb8460e feat(server): trace serving admission batches
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Push fork branch**
|
||||
|
||||
Blocked by policy: ask before every push. Do not push without explicit approval.
|
||||
|
||||
- [ ] **Step 3: Regenerate LocalAI patch series**
|
||||
|
||||
Pending until the fork branch is pushed, per the fork-first mirror invariant.
|
||||
|
||||
- [x] **Step 4: Record Phase51 status in LocalAI docs**
|
||||
|
||||
Record the fork commit, DGX artifact, gates, and pending push/mirror state.
|
||||
Reference in New Issue
Block a user