From cced07c7feaaf16e1c400b1a30a32f3bfdbe94e1 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 02:54:29 +0000 Subject: [PATCH] docs(paged): add MTP shape trace patch Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results. Assisted-by: Codex:gpt-5 --- backend/cpp/llama-cpp-localai-paged/README.md | 9 +- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 50 ++++++++ .../docs/PARITY_HANDOFF.md | 18 +++ .../docs/VLLM_PARITY_LEVER_MAP.md | 32 +++++ ...erver-trace-speculative-batch-shapes.patch | 57 +++++++++ .../2026-07-01-mtp-shape-trace-phase18.md | 112 ++++++++++++++++++ 6 files changed, 277 insertions(+), 1 deletion(-) create mode 100644 backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-server-trace-speculative-batch-shapes.patch create mode 100644 docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 0ad0604a0..18875a98c 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -87,7 +87,7 @@ orthogonal to the paged allocator. --- -## 3. Patch series (0001-0047) +## 3. Patch series (0001-0055) Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 / @@ -207,6 +207,13 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact | 0047 | **GDN M5 tensor-core chunked-scan prefill, f32-only re-port, default-ON under paged KV** - the f32/tf32 tensor-core forms of 0031's scan (KK/QK Gram = M2, KS/QS state-boundary 3xtf32 = M3, P*U output = M4, full form-T solve + state-update mma = M5), single build, runtime-selected by `GDN_TC`. Ships **M5 default-on when `LLAMA_KV_PAGED` is set** (`GDN_TC=5` + `GDN_CHUNK_MIN=64`, both env-overridable; OFF/`INT_MAX` when not paged). `GDN_CHUNK_MIN` is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence (at 1 it swallows decode and drops S_TG ~25%); 64 tuned from a {1,32,64,128,256} sweep. The bf16/hybrid dev-tree machinery (STATE_BF16/HYBRID, the dropped 0026 ssm_bf16_tau) and the bf16 CONFIG-C (M8) plus register-resident M6/M7 variants are NOT part of this f32-only series. MoE prefill S_PP +3.5% @npp512 (3x A/B), +17.7% @npp2048; decode S_TG unchanged. | NEW per-path, benign (`test-backend-ops` GATED_DELTA_NET 46/46 default AND force-M5, incl. multi-chunk/tail-chunk/multi-seq; greedy md5 default-on == M5-forced == canonical on the gate prompt: paged-MoE `8cb0ce23`, dense `5951a5b4`; long MoE prompt = one benign greedy flip vs sequential, dense byte-identical) | | 0046 | **GDN prefill geometry gated by scan length** - patch 0022's `(NUM_WARPS=16, COLS_PER_WARP=8)` column-fold of the GDN sequential-recurrence dispatch (`case 128`) is a decode win but was applied UNCONDITIONALLY, so it also hit dense prefill (~-6% vs stock): on a long sequential scan the launch `grid.z` collapses from `S_v/4 = 32` to `S_v/(16*8) = 1` and the SMs starve (profiled: `gated_delta_net` +54% GPU time = the whole dense-prefill regression). Gate the geometry by per-call scan length: long scans (prefill, `n_tokens >= GDN_PREFILL_NTOK`, default 256) take stock's high-grid.z `(4,1)` geometry; short scans (decode) keep the `(16,8)` retune. Recovers dense prefill +7.2% back to stock parity, keeps the decode win. `GDN_PREFILL_NTOK` tunes the crossover; an explicit `GDN_NW`/`GDN_CPW` sweep still overrides (gate yields when either is set), so the one-build %peak A/B harness is unchanged. | yes (patch 0022 proved every `{NW,CPW}` variant byte-identical, so switching geometry by scan length cannot move the md5) | +### Speculative / MTP investigation (0054, 0055) + +| # | What it does | Bit-exact / effect | +|---|---|---| +| 0054 | **Disable backend sampling for MTP drafts** - forces server MTP draft generation through the target-side sampler acceptance path instead of letting the draft backend sample independently. This was required for the Phase 14 rollback/prefix safety gate. | yes for canonical non-MTP gates; Phase 14 MTP normalized greedy-prefix gate passed | +| 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) | + > **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once > the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) > landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all** diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index abe98d9fc..961734ea3 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1253,3 +1253,53 @@ Conclusion: production candidate must reduce `mmq_nvfp4` or activation movement directly, stay free of D2H id readback and new stream synchronizations, and then pass the same md5/op gates before any serving A/B is considered. + +## Phase 18 MTP Shape Trace + +Phase 18 implemented the Phase 17 instrumentation-only recommendation as +patch `0055-feat-server-trace-speculative-batch-shapes.patch`. + +Implementation summary: + +- Added default-off `LLAMA_SPEC_SHAPE_TRACE=1` logging in + `server_slot::handle_last_sampled_token()`. +- Normal decode logs one row/output per slot. +- MTP verification logs `K + 1` rows/outputs per speculative slot, including + draft length and `slot.spec_i_batch` range. +- No scheduler, graph-key, KV, logits, acceptance, or rollback behavior changed. + +Red/green trace artifacts: + +- Red check before patch: `/home/mudler/bench/phase18_mtp_shape_trace_red` +- Green check after patch: `/home/mudler/bench/phase18_mtp_shape_trace_green` + +Green trace sample: + +```text +spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5 +spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6 +spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9 +``` + +Disabled-env check: + +- `LLAMA_SPEC_SHAPE_TRACE` unset emitted no `spec shape:` lines. + +Inference gate artifact: + +- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after` + +Safety result: + +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. + +Conclusion: + +- Patch 0055 is safe instrumentation and does not break inferencing on the + canonical gated paths. +- The trace confirms per-step MTP verification shape variation even in a tiny + request (`rows=4` and `rows=3`). +- A follow-up scheduler experiment is not yet justified. First use this trace + under real serving load to measure draft-length bucket entropy. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 56cfde7da..5516e197e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -244,6 +244,24 @@ rollback semantics. If reopened, start with a server-only shape counter around group/defer-by-draft-length scheduler experiment, with TTFT/throughput and md5/op gates as kill criteria. +Phase 18 added the server-only shape trace as patch 0055. Set +`LLAMA_SPEC_SHAPE_TRACE=1` to log `kind=decode` rows and MTP `kind=verify` +`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`. +This is default-off instrumentation only. DGX green check after the patch saw +MTP verify shapes vary (`rows=4`, then `rows=3`) on a tiny request, while the +env-unset run emitted no `spec shape:` lines. Canonical post-patch gates passed: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +Artifacts: +`/home/mudler/bench/phase18_mtp_shape_trace_green` and +`/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`. + +Next MTP step, if any: trace real serving shape entropy first. Do not implement +a scheduler change until the trace shows repeatable draft-length buckets worth +grouping. Any scheduler experiment must be opt-in/default-off and killed by +TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP +rollback/prefix gate failure. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index a3b95e7fd..2ed083ee7 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -554,6 +554,38 @@ Only after that should an opt-in scheduling experiment group/defer MTP verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT or throughput regresses, graph reuse does not recover, or the md5/op gates drift. +### Phase 18 MTP shape trace + +Phase 18 added that instrumentation-only patch as 0055. Set +`LLAMA_SPEC_SHAPE_TRACE=1` to log normal decode rows and MTP verification +`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`. +It is default-off and does not change scheduling, graph keys, logits, KV state, +acceptance, or rollback behavior. + +Red/green result: + +- before patch, `LLAMA_SPEC_SHAPE_TRACE=1` emitted no `spec shape:` lines; +- after patch, a tiny MTP request emitted `kind=verify` shapes with `rows=4` + and `rows=3`; +- with the env var unset, the patched server emitted no `spec shape:` lines. + +Canonical post-patch inference gates stayed green: + +- MoE `8cb0ce23777bf55f92f63d0292c756b0`; +- dense `5951a5b4d624ce891e22ab5fca9bc439`; +- `MUL_MAT_ID` `806/806`. + +Artifacts: + +- `/home/mudler/bench/phase18_mtp_shape_trace_green` +- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after` + +Follow-up scope: before any source behavior change, run a trace-only real +serving entropy measurement. Only if repeatable draft-length buckets appear +should an opt-in group/defer-by-draft-length scheduler be built; kill it on +TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP +rollback/prefix gate failure. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-server-trace-speculative-batch-shapes.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-server-trace-speculative-batch-shapes.patch new file mode 100644 index 000000000..3d2081631 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-server-trace-speculative-batch-shapes.patch @@ -0,0 +1,57 @@ +From fb9402661291e0488a3e2bf2f3948ebcd18e18c9 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 1 Jul 2026 02:41:22 +0000 +Subject: [PATCH] feat(server): trace speculative batch shapes + +Add an env-gated LLAMA_SPEC_SHAPE_TRACE log around the server batch rows emitted by normal decode and speculative verification slots. This keeps the instrumentation default-off while exposing the row/output shape entropy that prevents CUDA graph reuse under MTP serving. + +Assisted-by: Codex:gpt-5 +--- + tools/server/server-context.cpp | 20 ++++++++++++++++++-- + 1 file changed, 18 insertions(+), 2 deletions(-) + +diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp +index a77e2676d..fd8348af6 100644 +--- a/tools/server/server-context.cpp ++++ b/tools/server/server-context.cpp +@@ -457,12 +457,22 @@ struct server_slot { + + // add sampled token of this slot to the batch, optionally add the speculative draft tokens if any + void handle_last_sampled_token(server_batch & batch) { ++ static const bool spec_shape_trace = getenv("LLAMA_SPEC_SHAPE_TRACE") != nullptr; ++ const int32_t batch_before = batch.size(); ++ + bool add_ok = true; + if (spec_draft.empty()) { + // no speculative decoding +- i_batch = batch.size(); ++ i_batch = batch_before; ++ ++ const int32_t pos0 = prompt.tokens.pos_next(); ++ ++ add_ok &= batch.add(id, sampled, pos0, true); + +- add_ok &= batch.add(id, sampled, prompt.tokens.pos_next(), true); ++ if (spec_shape_trace) { ++ SLT_INF(*this, "spec shape: kind=decode batch_before=%d rows=1 outputs=1 draft=0 pos0=%d slot_tokens=%zu\n", ++ batch_before, pos0, prompt.tokens.size()); ++ } + + SLT_DBG(*this, "slot decode token, id=%d, n_ctx = %d, n_tokens = %d, truncated = %d\n", + sampled, n_ctx, prompt.n_tokens(), truncated); +@@ -479,6 +489,12 @@ struct server_slot { + + auto pos0 = prompt.tokens.pos_next(); + ++ if (spec_shape_trace) { ++ SLT_INF(*this, "spec shape: kind=verify batch_before=%d rows=%zu outputs=%zu draft=%zu spec_i_first=%d spec_i_last=%d pos0=%d slot_tokens=%zu\n", ++ batch_before, spec_draft.size() + 1, spec_draft.size() + 1, spec_draft.size(), ++ spec_i_batch.front(), spec_i_batch.back(), pos0, prompt.tokens.size()); ++ } ++ + add_ok &= batch.add(id, sampled, pos0++, true); + for (auto token : spec_draft) { + add_ok &= batch.add(this->id, token, pos0++, true); +-- +2.43.0 + diff --git a/docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md b/docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md new file mode 100644 index 000000000..c4aaa0c44 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md @@ -0,0 +1,112 @@ +# MTP Shape Trace Phase 18 Plan + +> **For agentic workers:** REQUIRED SUB-SKILLS: Use +> superpowers:test-driven-development before source edits and +> superpowers:verification-before-completion before commit. Steps use checkbox +> (`- [ ]`) syntax for tracking. + +**Goal:** add a default-off, inference-safe trace for speculative/MTP server +batch shape entropy before considering any scheduler experiment. + +**Architecture:** keep this as a server-only instrumentation patch in +`server_slot::handle_last_sampled_token()`. Do not change speculative +acceptance, rollback, logits, KV writes, graph-reuse keys, or scheduling. + +**Tech Stack:** llama.cpp `tools/server/server-context.cpp`, LocalAI paged +patch stack, DGX GB10 validation. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove the trace does not already exist** + + Ran a direct MTP `llama-server` request on DGX with + `LLAMA_SPEC_SHAPE_TRACE=1` before the source patch. + + Result: + + - no `spec shape:` lines were emitted, + - artifact: `/home/mudler/bench/phase18_mtp_shape_trace_red`. + +## Task 2: Instrumentation Patch + +- [x] **Step 1: Add an env-gated trace** + + Added `LLAMA_SPEC_SHAPE_TRACE=1` logging in + `server_slot::handle_last_sampled_token()`: + + - normal decode rows: `kind=decode`, `rows=1`, `outputs=1`, `draft=0`, + - speculative verification rows: `kind=verify`, `rows=K+1`, + `outputs=K+1`, `draft=K`, `spec_i_first`, `spec_i_last`. + + The env var is default-off and does not alter batch contents. + +- [x] **Step 2: Keep the patch incremental** + + Local fork commit: + + - `fb9402661 feat(server): trace speculative batch shapes` + + LocalAI patch: + + - `0055-feat-server-trace-speculative-batch-shapes.patch` + +## Task 3: Green Checks + +- [x] **Step 1: Build and validate trace behavior on DGX** + + DGX mirror commit: + + - `f2521ab12 feat(server): trace speculative batch shapes` + + Build: + + - `cmake --build build-cuda --target llama-server -j$(nproc)` + + Trace-enabled result: + + ```text + spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5 + spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6 + spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9 + ``` + + Trace-disabled result: + + ```text + trace disabled: no spec shape lines + ``` + + Artifact: + + - `/home/mudler/bench/phase18_mtp_shape_trace_green` + +- [x] **Step 2: Run canonical inference gates** + + Artifact: + + - `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Task 4: Follow-Up Boundary + +- [x] **Step 1: Scope Phase 19** + + Use the trace to measure shape entropy under real serving load before any + behavior change. A Phase 19 scheduler experiment is allowed only if the trace + shows repeatable draft-length buckets worth grouping. It must be opt-in, + default-off, and killed by TTFT/throughput regression, md5/op drift, or MTP + rollback/prefix failure. + +## Self-Review + +- No default behavior changed. +- The trace is read-only with respect to batch contents and slot state. +- The post-patch canonical md5/op gates passed, so this instrumentation did not + break inferencing on the gated paths.