docs(paged): add MTP shape trace patch

Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 02:54:29 +00:00
parent 6e35476340
commit cced07c7fe
6 changed files with 277 additions and 1 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -87,7 +87,7 @@ orthogonal to the paged allocator.

 ---

-## 3. Patch series (0001-0047)
+## 3. Patch series (0001-0055)

 Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
 decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
@@ -207,6 +207,13 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
 | 0047 | **GDN M5 tensor-core chunked-scan prefill, f32-only re-port, default-ON under paged KV** - the f32/tf32 tensor-core forms of 0031's scan (KK/QK Gram = M2, KS/QS state-boundary 3xtf32 = M3, P*U output = M4, full form-T solve + state-update mma = M5), single build, runtime-selected by `GDN_TC`. Ships **M5 default-on when `LLAMA_KV_PAGED` is set** (`GDN_TC=5` + `GDN_CHUNK_MIN=64`, both env-overridable; OFF/`INT_MAX` when not paged). `GDN_CHUNK_MIN` is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence (at 1 it swallows decode and drops S_TG ~25%); 64 tuned from a {1,32,64,128,256} sweep. The bf16/hybrid dev-tree machinery (STATE_BF16/HYBRID, the dropped 0026 ssm_bf16_tau) and the bf16 CONFIG-C (M8) plus register-resident M6/M7 variants are NOT part of this f32-only series. MoE prefill S_PP +3.5% @npp512 (3x A/B), +17.7% @npp2048; decode S_TG unchanged. | NEW per-path, benign (`test-backend-ops` GATED_DELTA_NET 46/46 default AND force-M5, incl. multi-chunk/tail-chunk/multi-seq; greedy md5 default-on == M5-forced == canonical on the gate prompt: paged-MoE `8cb0ce23`, dense `5951a5b4`; long MoE prompt = one benign greedy flip vs sequential, dense byte-identical) |
 | 0046 | **GDN prefill geometry gated by scan length** - patch 0022's `(NUM_WARPS=16, COLS_PER_WARP=8)` column-fold of the GDN sequential-recurrence dispatch (`case 128`) is a decode win but was applied UNCONDITIONALLY, so it also hit dense prefill (~-6% vs stock): on a long sequential scan the launch `grid.z` collapses from `S_v/4 = 32` to `S_v/(16*8) = 1` and the SMs starve (profiled: `gated_delta_net` +54% GPU time = the whole dense-prefill regression). Gate the geometry by per-call scan length: long scans (prefill, `n_tokens >= GDN_PREFILL_NTOK`, default 256) take stock's high-grid.z `(4,1)` geometry; short scans (decode) keep the `(16,8)` retune. Recovers dense prefill +7.2% back to stock parity, keeps the decode win. `GDN_PREFILL_NTOK` tunes the crossover; an explicit `GDN_NW`/`GDN_CPW` sweep still overrides (gate yields when either is set), so the one-build %peak A/B harness is unchanged. | yes (patch 0022 proved every `{NW,CPW}` variant byte-identical, so switching geometry by scan length cannot move the md5) |

+### Speculative / MTP investigation (0054, 0055)
+
+| # | What it does | Bit-exact / effect |
+|---|---|---|
+| 0054 | **Disable backend sampling for MTP drafts** - forces server MTP draft generation through the target-side sampler acceptance path instead of letting the draft backend sample independently. This was required for the Phase 14 rollback/prefix safety gate. | yes for canonical non-MTP gates; Phase 14 MTP normalized greedy-prefix gate passed |
+| 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) |
+
 > **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
 > the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
 > landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all**
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1253,3 +1253,53 @@ Conclusion:
  production candidate must reduce `mmq_nvfp4` or activation movement directly,
  stay free of D2H id readback and new stream synchronizations, and then pass
  the same md5/op gates before any serving A/B is considered.
+
+## Phase 18 MTP Shape Trace
+
+Phase 18 implemented the Phase 17 instrumentation-only recommendation as
+patch `0055-feat-server-trace-speculative-batch-shapes.patch`.
+
+Implementation summary:
+
+- Added default-off `LLAMA_SPEC_SHAPE_TRACE=1` logging in
+  `server_slot::handle_last_sampled_token()`.
+- Normal decode logs one row/output per slot.
+- MTP verification logs `K + 1` rows/outputs per speculative slot, including
+  draft length and `slot.spec_i_batch` range.
+- No scheduler, graph-key, KV, logits, acceptance, or rollback behavior changed.
+
+Red/green trace artifacts:
+
+- Red check before patch: `/home/mudler/bench/phase18_mtp_shape_trace_red`
+- Green check after patch: `/home/mudler/bench/phase18_mtp_shape_trace_green`
+
+Green trace sample:
+
+```text
+spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5
+spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6
+spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9
+```
+
+Disabled-env check:
+
+- `LLAMA_SPEC_SHAPE_TRACE` unset emitted no `spec shape:` lines.
+
+Inference gate artifact:
+
+- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`
+
+Safety result:
+
+- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- Full `MUL_MAT_ID`: `806/806` on CUDA0.
+
+Conclusion:
+
+- Patch 0055 is safe instrumentation and does not break inferencing on the
+  canonical gated paths.
+- The trace confirms per-step MTP verification shape variation even in a tiny
+  request (`rows=4` and `rows=3`).
+- A follow-up scheduler experiment is not yet justified. First use this trace
+  under real serving load to measure draft-length bucket entropy.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -244,6 +244,24 @@ rollback semantics. If reopened, start with a server-only shape counter around
 group/defer-by-draft-length scheduler experiment, with TTFT/throughput and
 md5/op gates as kill criteria.

+Phase 18 added the server-only shape trace as patch 0055. Set
+`LLAMA_SPEC_SHAPE_TRACE=1` to log `kind=decode` rows and MTP `kind=verify`
+`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`.
+This is default-off instrumentation only. DGX green check after the patch saw
+MTP verify shapes vary (`rows=4`, then `rows=3`) on a tiny request, while the
+env-unset run emitted no `spec shape:` lines. Canonical post-patch gates passed:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+Artifacts:
+`/home/mudler/bench/phase18_mtp_shape_trace_green` and
+`/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`.
+
+Next MTP step, if any: trace real serving shape entropy first. Do not implement
+a scheduler change until the trace shows repeatable draft-length buckets worth
+grouping. Any scheduler experiment must be opt-in/default-off and killed by
+TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
+rollback/prefix gate failure.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -554,6 +554,38 @@ Only after that should an opt-in scheduling experiment group/defer MTP
 verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT
 or throughput regresses, graph reuse does not recover, or the md5/op gates drift.

+### Phase 18 MTP shape trace
+
+Phase 18 added that instrumentation-only patch as 0055. Set
+`LLAMA_SPEC_SHAPE_TRACE=1` to log normal decode rows and MTP verification
+`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`.
+It is default-off and does not change scheduling, graph keys, logits, KV state,
+acceptance, or rollback behavior.
+
+Red/green result:
+
+- before patch, `LLAMA_SPEC_SHAPE_TRACE=1` emitted no `spec shape:` lines;
+- after patch, a tiny MTP request emitted `kind=verify` shapes with `rows=4`
+  and `rows=3`;
+- with the env var unset, the patched server emitted no `spec shape:` lines.
+
+Canonical post-patch inference gates stayed green:
+
+- MoE `8cb0ce23777bf55f92f63d0292c756b0`;
+- dense `5951a5b4d624ce891e22ab5fca9bc439`;
+- `MUL_MAT_ID` `806/806`.
+
+Artifacts:
+
+- `/home/mudler/bench/phase18_mtp_shape_trace_green`
+- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`
+
+Follow-up scope: before any source behavior change, run a trace-only real
+serving entropy measurement. Only if repeatable draft-length buckets appear
+should an opt-in group/defer-by-draft-length scheduler be built; kill it on
+TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
+rollback/prefix gate failure.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-server-trace-speculative-batch-shapes.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-server-trace-speculative-batch-shapes.patch
@@ -0,0 +1,57 @@
+From fb9402661291e0488a3e2bf2f3948ebcd18e18c9 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 1 Jul 2026 02:41:22 +0000
+Subject: [PATCH] feat(server): trace speculative batch shapes
+
+Add an env-gated LLAMA_SPEC_SHAPE_TRACE log around the server batch rows emitted by normal decode and speculative verification slots. This keeps the instrumentation default-off while exposing the row/output shape entropy that prevents CUDA graph reuse under MTP serving.
+
+Assisted-by: Codex:gpt-5
+---
+ tools/server/server-context.cpp | 20 ++++++++++++++++++--
+ 1 file changed, 18 insertions(+), 2 deletions(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index a77e2676d..fd8348af6 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -457,12 +457,22 @@ struct server_slot {
+ 
+     // add sampled token of this slot to the batch, optionally add the speculative draft tokens if any
+     void handle_last_sampled_token(server_batch & batch) {
+        static const bool spec_shape_trace = getenv("LLAMA_SPEC_SHAPE_TRACE") != nullptr;
+        const int32_t batch_before = batch.size();
+
+         bool add_ok = true;
+         if (spec_draft.empty()) {
+             // no speculative decoding
+-            i_batch = batch.size();
+            i_batch = batch_before;
+
+            const int32_t pos0 = prompt.tokens.pos_next();
+
+            add_ok &= batch.add(id, sampled, pos0, true);
+ 
+-            add_ok &= batch.add(id, sampled, prompt.tokens.pos_next(), true);
+            if (spec_shape_trace) {
+                SLT_INF(*this, "spec shape: kind=decode batch_before=%d rows=1 outputs=1 draft=0 pos0=%d slot_tokens=%zu\n",
+                        batch_before, pos0, prompt.tokens.size());
+            }
+ 
+             SLT_DBG(*this, "slot decode token, id=%d, n_ctx = %d, n_tokens = %d, truncated = %d\n",
+                     sampled, n_ctx, prompt.n_tokens(), truncated);
+@@ -479,6 +489,12 @@ struct server_slot {
+ 
+             auto pos0 = prompt.tokens.pos_next();
+ 
+            if (spec_shape_trace) {
+                SLT_INF(*this, "spec shape: kind=verify batch_before=%d rows=%zu outputs=%zu draft=%zu spec_i_first=%d spec_i_last=%d pos0=%d slot_tokens=%zu\n",
+                        batch_before, spec_draft.size() + 1, spec_draft.size() + 1, spec_draft.size(),
+                        spec_i_batch.front(), spec_i_batch.back(), pos0, prompt.tokens.size());
+            }
+
+             add_ok &= batch.add(id, sampled, pos0++, true);
+             for (auto token : spec_draft) {
+                 add_ok &= batch.add(this->id, token, pos0++, true);
+-- 
+2.43.0
+