From 3dbf34e7399d5613c0afd34096fbb70a71a0cf27 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 09:40:50 +0000 Subject: [PATCH] docs(paged): record admission histogram trace Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 79 ++++++++++ .../docs/PARITY_HANDOFF.md | 15 +- .../docs/VLLM_PARITY_LEVER_MAP.md | 32 ++++ ...07-01-admission-histogram-trace-phase54.md | 139 ++++++++++++++++++ 4 files changed, 264 insertions(+), 1 deletion(-) create mode 100644 docs/superpowers/plans/2026-07-01-admission-histogram-trace-phase54.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 8c0c1694d..565a92190 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2976,3 +2976,82 @@ Decision: - Do not promote simple budget shrinkage as a parity lever. The next useful scheduler work is a richer per-step histogram trace or a targeted first-token admission policy, not a static lower `LLAMA_MAX_BATCH_TOKENS`. + +## Phase 54 Admission Histogram Trace + +Phase 54 extends the Phase51 trace with compact per-step histograms for prompt +tokens, decode tokens, and waiting prompt slots. This is still trace-only and +default-off behind `LLAMA_SERVING_TRACE=1`; it does not change scheduling or +inference. + +Fork commits: + +- `c6cb8460e feat(server): trace serving admission batches` +- `bd7b2e952 feat(server): add admission trace histograms` + +Artifact: + +- `/home/mudler/bench/phase54_admission_hist_trace/20260701_113201` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Focused test/build: + +- Red test first: histogram assertions failed before implementation. +- Local fork: `test-server-admission-trace` passed, CTest passed, and + `llama-server` built. +- DGX `build-cuda`: `test-server-admission-trace` passed under CTest after the + temporary Phase51+Phase54 patch stack was applied. + +Phase52-aligned dense trace: + +- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf` +- `LLAMA_SERVING_TRACE=1` +- `N=128`, `PTOK=168`, `GEN=64` +- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512` + +H2H result: + +| n | prompt tokens | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s | +|---|---------------|---------|-----------------|---------------------|-------------|--------------|--------| +| 128 | `22913` | `138.1` | `360.2` | `1.92` | `626.7` | `23393.2` | `59.303` | + +Trace: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +Interpretation: + +- The Phase54 run matches the Phase52 serving envelope: same `76` steps, same + `8064` trace decode tokens, same `267` waiting prompt slots, and throughput + within noise. +- `63/76` steps have `prompt_tokens=0` and `waiting_prompt_slots=0`. +- Prompt admission is concentrated in a small number of very large chunks: + `prompt_hist=513+:12`. +- Decode is mostly full-width during active decode: + `decode_hist=128-255:53`. +- The scheduler still emits no pure decode-only steps for this shape. + +Decision: + +- The histogram strengthens the Phase53 rejection of static lower batch + budgets. The issue is not a uniformly oversized prompt budget every step; + prompt work arrives in a few large chunks and first-token latency remains high. +- The next scheduler A/B should be a targeted first-token admission or prompt + front-loading policy that changes when first prompt chunks are admitted, while + keeping md5/op gates unchanged. Do not reduce `LLAMA_MAX_BATCH_TOKENS` globally + as the next parity lever. + +Mirror status: + +- Both trace commits are local and DGX-gated. +- The LocalAI `patches/paged/` series is not regenerated yet because the + handoff requires pushing the fork branch first, and pushes require explicit + approval. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index c4ac3a910..00e26e7cb 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -20,7 +20,20 @@ Read order for a cold start: ## 1. TL;DR STATE -- The investigation is **CLOSED**. Parity is **not reachable on GB10** silicon; the residual is a hardware ceiling, not engineering debt. +> 2026-07-01 active update: Phase50-54 reopened the dense serving question. +> True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`) +> than the Phase47 h2h aggregate suggested, while traced serving still shows +> no pure decode-only steps and high TTFT. Phase53 rejected static lower +> admission budgets; Phase54 histograms show prompt admission concentrated in a +> few large chunks (`prompt_hist=513+:12`) with mostly full-width decode +> (`decode_hist=128-255:53`). Next scheduler work should be a targeted +> first-token admission or prompt-front-loading A/B, not another global +> `LLAMA_MAX_BATCH_TOKENS` reduction. The trace commits are local and DGX-gated +> but not pushed, so the LocalAI patch series has not been regenerated. + +- Historical verdict: the older investigation marked GB10 parity **CLOSED** and + unreachable. Treat that as superseded where Phase50-54 provide newer dense + serving evidence. - **Prefill** is a genuine floor at **~36% (MoE) / ~43% (dense)** of vLLM. Prefill is **not** CUDA-graph-replayed, so these numbers are real, not measurement artifacts. - **Decode** is **near-parity: ~86% of vLLM's TRUE GPU-steady decode** (924 vs 1078 t/s). The long-standing **~56% headline was a CUDA-graph measurement artifact** (nsys without `--cuda-graph-trace=node` collapses each graph replay into one opaque launch). Decode is also **ahead of vLLM at low concurrency** (dense 116.7% at N=8) and uses **1.5-3x less memory**, bit-exact per-path. - The lever search was **exhaustive**: every attempt (prefill GEMM, GDN chunked scan, decode fusions, serving/scheduler) is recorded with its verdict and number so it is **not re-run**. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index b6eaf4489..402a36806 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1291,6 +1291,38 @@ throughput and prefill throughput fall, and TTFT does not materially improve. Next scheduler work should collect per-step histograms or test a targeted first-token admission policy. +### Phase 54 admission histogram trace + +Phase54 extended the Phase51 default-off trace with prompt-token, +decode-token, and waiting-slot histograms. Fork stack: +`c6cb8460e feat(server): trace serving admission batches` and +`bd7b2e952 feat(server): add admission trace histograms`. + +Artifact: +`/home/mudler/bench/phase54_admission_hist_trace/20260701_113201`. + +Pre/post md5 and op gates stayed green on the temporary DGX patch stack: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. + +The Phase52-aligned dense run used `n=128`, `ptok=168`, `gen=64`, producing +`prompt_tok_total=22913`, `agg_tps=138.1`, `decode_agg_tps=360.2`, +`prefill_tps=626.7`, `ttft_mean_ms=23393.2`, and `wall_s=59.303`. + +Trace: + +```text +steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +Decision: the scheduler does not spend every step over-admitting prompt work. +Most steps have no waiting prompts and no prompt tokens, while prompt admission +is concentrated into a small number of large chunks. This rejects global +budget-shrinkage as the next path and points to a targeted first-token +admission or prompt-front-loading A/B, gated by the same md5 and backend-op +checks. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-admission-histogram-trace-phase54.md b/docs/superpowers/plans/2026-07-01-admission-histogram-trace-phase54.md new file mode 100644 index 000000000..bed2b83f3 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-admission-histogram-trace-phase54.md @@ -0,0 +1,139 @@ +# Phase54 Admission Histogram Trace Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Extend the Phase51 default-off serving trace with compact per-step histograms so scheduler work can see whether the dense high-N run is dominated by a few very large prompt-admission steps, many small mixed steps, or waiting-slot tails. + +**Architecture:** Keep the trace fork-first and default-off behind `LLAMA_SERVING_TRACE=1`. Add only accumulator buckets and formatter output, then temporarily apply the Phase51+Phase54 stack to the DGX mirror, bracket with canonical md5/op gates, run the Phase52-aligned dense trace, and revert the DGX mirror. + +**Tech Stack:** llama.cpp fork, `tools/server/server-admission-trace.h`, CMake unit test, DGX GB10 `build-cuda`, `h2h_cli.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Add red histogram assertions + +- [x] **Step 1: Extend the focused unit test** + +Added assertions to `tests/test-server-admission-trace.cpp` requiring: + +- `prompt_hist=0:1,257-512:1` +- `decode_hist=128-255:2` +- `waiting_hist=1-7:2` + +- [x] **Step 2: Verify red** + +Observed failure before implementation: + +```text +missing 'prompt_hist=0:1,257-512:1' +``` + +### Task 2: Implement histogram counters + +- [x] **Step 1: Add bucket counters and formatting** + +Added prompt-token, decode-token, and waiting-slot histograms to +`server_admission_trace_totals`. The formatter emits only nonzero buckets. + +- [x] **Step 2: Verify local green** + +Commands: + +```bash +cmake --build build --target test-server-admission-trace -j2 +./build/bin/test-server-admission-trace +ctest --test-dir build -R '^test-server-admission-trace$' --output-on-failure +cmake --build build --target llama-server -j2 +``` + +Observed: focused unit test passed, CTest passed, and `llama-server` built. The +local UI asset build first hit a Node engine mismatch and then recovered through +the repo's downloaded UI bundle path. + +### Task 3: Commit fork patch + +- [x] **Step 1: Commit on the llama.cpp fork** + +Local fork commit: + +```text +bd7b2e952 feat(server): add admission trace histograms +``` + +Fork stack now has two unpushed trace commits: + +- `c6cb8460e feat(server): trace serving admission batches` +- `bd7b2e952 feat(server): add admission trace histograms` + +- [ ] **Step 2: Push fork branch** + +Blocked by policy: ask before every push. Do not push without explicit approval. + +- [ ] **Step 3: Regenerate LocalAI patch series** + +Pending until the fork branch is pushed, per the fork-first mirror invariant. + +### Task 4: Verify on DGX + +- [x] **Step 1: Apply temporary stack and build** + +Applied `/tmp/phase54-admission-trace-stack.patch` to the clean +`~/llama-phase6-source` mirror. Built `test-server-admission-trace`, +`llama-server`, `llama-cli`, and `test-backend-ops` in `build-cuda`. + +DGX CTest passed: + +```bash +ctest --test-dir build-cuda -R '^test-server-admission-trace$' --output-on-failure +``` + +- [x] **Step 2: Run canonical pre/post inference gates** + +Artifact: +`/home/mudler/bench/phase54_admission_hist_trace/20260701_113201`. + +Pre and post gates both matched: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 3: Run dense histogram trace** + +First diagnostic run used `--ptok 128` and produced `prompt_tok_total=17793`; +kept as `paged_hist/`. + +The Phase52-aligned run used `--ptok 168`, matching the prior prompt envelope: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 22913, "gen_per_req": 64.0, "agg_tps": 138.1, "decode_agg_tps": 360.2, "decode_perseq_tps": 1.92, "prefill_tps": 626.7, "ttft_mean_ms": 23393.2, "ttft_max_ms": 36560.5, "wall_s": 59.303} +``` + +Trace: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +### Task 5: Clean up and decide + +- [x] **Step 1: Revert temporary DGX stack** + +Reverted the temporary patch stack and removed the two untracked trace files it +created on the DGX mirror. Final source tree was clean. + +Final DGX state: + +- Docker containers: `0` +- GPU compute apps: `0` +- Lock: `FREE released-by-codex-phase54-hist 1782898659` + +- [x] **Step 2: Record decision** + +The histogram shows the default scheduler spends `63/76` steps with no prompt +tokens and no waiting prompts, then admits prompt work in a small number of very +large prompt chunks (`prompt_hist=513+:12`). Decode remains mostly full-width +(`decode_hist=128-255:53`) and there are still no pure decode-only steps. Static +budget shrinkage is already rejected; the next scheduler A/B should target +first-token admission or prompt-front loading, not lower global batch budgets.