diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index bff1af61f..126f7fb3e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2879,3 +2879,62 @@ Mirror status: - The LocalAI `patches/paged/` series is not regenerated yet because the handoff requires pushing the fork branch first, and pushes require explicit approval. + +## Phase 52 Dense Admission Trace + +Phase 52 uses the Phase51 trace to capture the actual dense `n=128` serving +admission shape. The Phase51 patch was applied temporarily to the clean DGX +mirror, built, gated, used for the trace, and then reverted from the mirror. + +Artifact: + +- `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Clean run shape: + +- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf` +- `LLAMA_SERVING_TRACE=1` +- `N=128`, `PTOK=128`, `GEN=64` +- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512` + +H2H result: + +| n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s | +|---|---------|-----------------|---------------------|-------------|--------------|--------| +| 128 | `139.0` | `360.5` | `1.93` | `629.5` | `23171.5` | `58.921` | + +Admission trace: + +| steps | decode-only steps | decode tokens | prompt tokens | waiting prompt slots | max waiting prompt slots | started prompt slots | continued prompt slots | +|-------|-------------------|---------------|---------------|----------------------|--------------------------|----------------------|------------------------| +| `76` | `0` | `8064` | `22785` | `267` | `35` | `128` | `139` | + +Derived values: + +- `prompt_tokens` matched h2h `prompt_tok_total` exactly: `22785`. +- `decode_tokens` were `128` fewer than h2h `gen_total`, which is expected for + one first-token transition per request. +- Average prompt tokens per scheduler step: `299.8`. +- Average decode tokens per scheduler step: `106.11`. +- Average waiting prompt slots per scheduler step: `3.51`. +- `prefill_budget_step=0` and `prefill_cap_per_slot=0`, confirming the default + stock n-batch-only prompt admission path. + +Decision: + +- The default dense `n=128` scheduler emits no pure decode steps + (`decode_only_steps=0`) and admits prompt work across mixed steps. That + explains why Phase47 h2h serving decode can lag the Phase50 true-decode ratio: + serving is shaped by mixed prompt/decode admission and TTFT, not just dense + decode kernels. +- The next code phase should be a small, default-off scheduler A/B or a richer + per-step histogram trace to test whether prefill chunking/admission can reduce + TTFT without regressing aggregate throughput. Do not move to another GDN/GEMM + rewrite until this scheduler hypothesis is tested. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index eb68bbb19..82a9c65ef 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -640,6 +640,18 @@ patched `build-cuda` CTest passed and inference gates stayed green: MoE `806/806`. Push and LocalAI patch-series regeneration are still pending because push requires explicit approval. +Phase 52 uses the Phase51 trace on DGX for dense `n=128`, `ptok=128`, `gen=64`. +Artifact: `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`. +Pre/post md5 and op gates stayed green. The clean traced h2h row was +`decode_agg_tps=360.5`, `prefill_tps=629.5`, `ttft_mean_ms=23171.5`, wall +`58.921s`. The admission trace reported `steps=76`, `decode_only_steps=0`, +`decode_tokens=8064`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`, +`started_prompt_slots=128`, `continued_prompt_slots=139`, +`prefill_budget_step=0`, and `prefill_cap_per_slot=0`. The prompt token count +matches h2h exactly, so this is the target request. The next GB10 lever should +be a default-off scheduler/admission A/B or a per-step histogram trace, not an +immediate GDN/GEMM rewrite. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -734,6 +746,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment. - `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing. - `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval. +- `~/bench/phase52_dense_admission_trace/20260701_111017` - clean dense `n=128` admission trace; pre/post gates green; `decode_only_steps=0`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`; next lever is scheduler/admission A/B or per-step histogram trace. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 15bb6c7b7..16db37360 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1243,6 +1243,30 @@ Verification: Mirror status: pending explicit approval to push the fork branch, then regenerate the LocalAI patch series from the pushed fork commit. +### Phase 52 dense admission trace + +Phase52 used the Phase51 trace on DGX to measure dense `n=128`, `ptok=128`, +`gen=64` llama-server admission. Artifact: +`/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`. + +The traced build was bracketed by canonical gates, all green before and after: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. + +Clean trace: + +| h2h wall s | decode agg t/s | TTFT mean ms | steps | decode-only steps | decode tokens | prompt tokens | max waiting prompt slots | +|------------|-----------------|--------------|-------|-------------------|---------------|---------------|--------------------------| +| `58.921` | `360.5` | `23171.5` | `76` | `0` | `8064` | `22785` | `35` | + +Decision: the default scheduler never emitted pure decode steps for this +high-N dense run. Prompt tokens matched h2h exactly, and prompt admission used +the stock path (`prefill_budget_step=0`, `prefill_cap_per_slot=0`). This +supports the Phase50 conclusion that the remaining high-N serving gap is +scheduler/admission and TTFT shaped. Next lever should be a default-off +admission-policy A/B or per-step histogram trace, not immediate kernel work. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-dense-admission-trace-phase52.md b/docs/superpowers/plans/2026-07-01-dense-admission-trace-phase52.md new file mode 100644 index 000000000..81ccea8de --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-dense-admission-trace-phase52.md @@ -0,0 +1,105 @@ +# Phase52 Dense Admission Trace Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Use the Phase51 `LLAMA_SERVING_TRACE=1` fork patch to capture dense `n=128` llama-server admission counters and determine whether high-N serving loss is scheduler/admission-driven. + +**Architecture:** Temporarily apply the Phase51 fork patch to the clean DGX mirror, build the patched server, bracket the traced serving run with canonical md5/op gates, run one dense `n=128`, `ptok=128`, `gen=64` h2h workload, parse the aggregate trace, then revert the DGX mirror. + +**Tech Stack:** DGX GB10, `~/llama-phase6-source/build-cuda`, `h2h_cli3.py`, `paged-inference-gates.sh`, LocalAI parity docs. + +--- + +### Task 1: Prepare patched DGX build + +**Files:** +- DGX artifact: `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017` + +- [x] **Step 1: Check DGX preflight** + +Observed before applying the patch: docker `0`, `local-ai-worker` `0`, +compute `0`, owner `FREE released-by-codex-phase50-dense-true-decode +1782895927`. + +- [x] **Step 2: Apply Phase51 patch and build** + +Applied `/tmp/phase51-serving-admission-trace.patch` to +`~/llama-phase6-source`. Built `llama-server`, `llama-completion`, and +`test-backend-ops` in `build-cuda`. + +### Task 2: Gate before trace + +- [x] **Step 1: Run canonical pre-trace inference gate** + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +### Task 3: Run dense admission trace + +- [x] **Step 1: Run warm trace** + +First trace included warmup and was kept only as a secondary artifact: +`paged/`. Because `started_prompt_slots=136`, it combined warmup `n=8` and the +target `n=128` request. + +- [x] **Step 2: Run clean `n=128` trace** + +Clean artifact: `paged_clean/`. + +H2H row: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 22785, "gen_per_req": 64.0, "agg_tps": 139.0, "decode_agg_tps": 360.5, "decode_perseq_tps": 1.93, "prefill_tps": 629.5, "ttft_mean_ms": 23171.5, "ttft_max_ms": 36195.3, "wall_s": 58.921} +``` + +Trace row: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22785 waiting_prompt_slots=267 max_waiting_prompt_slots=35 started_prompt_slots=128 continued_prompt_slots=139 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 +``` + +Parsed summary: `phase52_summary.json`. + +### Task 4: Gate after trace and clean DGX + +- [x] **Step 1: Run canonical post-trace inference gate** + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 2: Revert temporary DGX patch** + +Reverted `/tmp/phase51-serving-admission-trace.patch` from +`~/llama-phase6-source`. Final DGX state: docker `0`, `local-ai-worker` `0`, +compute `0`, owner `FREE released-by-codex-phase52-dense-admission-trace-clean +1782897309`. + +### Task 5: Record decision + +- [x] **Step 1: Update parity docs** + +Record Phase52 artifact and interpretation: + +- Prompt tokens admitted by the server trace exactly match h2h + `prompt_tok_total`, so the trace maps to the target request. +- `decode_only_steps=0`, so the default scheduler never emits pure decode steps + for this dense high-N serving shape. +- Prompt admission happens in `76` scheduler steps, averaging `299.8` prompt + tokens and `106.11` decode tokens per step, with up to `35` waiting prompt + slots. +- `prefill_budget_step=0` and `prefill_cap_per_slot=0` confirm stock + n-batch-only prompt admission was used. +- Next candidate should be an A/B of a small, default-off admission policy or a + trace extension with per-step histograms, not another immediate kernel rewrite. + +- [x] **Step 2: Commit LocalAI docs** + +Commit this plan and parity doc updates with `Assisted-by: Codex:gpt-5`.