docs(paged): add MTP shape trace patch

Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 02:54:29 +00:00
parent 6e35476340
commit cced07c7fe
6 changed files with 277 additions and 1 deletions

View File

@@ -87,7 +87,7 @@ orthogonal to the paged allocator.
---
## 3. Patch series (0001-0047)
## 3. Patch series (0001-0055)
Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
@@ -207,6 +207,13 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
| 0047 | **GDN M5 tensor-core chunked-scan prefill, f32-only re-port, default-ON under paged KV** - the f32/tf32 tensor-core forms of 0031's scan (KK/QK Gram = M2, KS/QS state-boundary 3xtf32 = M3, P*U output = M4, full form-T solve + state-update mma = M5), single build, runtime-selected by `GDN_TC`. Ships **M5 default-on when `LLAMA_KV_PAGED` is set** (`GDN_TC=5` + `GDN_CHUNK_MIN=64`, both env-overridable; OFF/`INT_MAX` when not paged). `GDN_CHUNK_MIN` is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence (at 1 it swallows decode and drops S_TG ~25%); 64 tuned from a {1,32,64,128,256} sweep. The bf16/hybrid dev-tree machinery (STATE_BF16/HYBRID, the dropped 0026 ssm_bf16_tau) and the bf16 CONFIG-C (M8) plus register-resident M6/M7 variants are NOT part of this f32-only series. MoE prefill S_PP +3.5% @npp512 (3x A/B), +17.7% @npp2048; decode S_TG unchanged. | NEW per-path, benign (`test-backend-ops` GATED_DELTA_NET 46/46 default AND force-M5, incl. multi-chunk/tail-chunk/multi-seq; greedy md5 default-on == M5-forced == canonical on the gate prompt: paged-MoE `8cb0ce23`, dense `5951a5b4`; long MoE prompt = one benign greedy flip vs sequential, dense byte-identical) |
| 0046 | **GDN prefill geometry gated by scan length** - patch 0022's `(NUM_WARPS=16, COLS_PER_WARP=8)` column-fold of the GDN sequential-recurrence dispatch (`case 128`) is a decode win but was applied UNCONDITIONALLY, so it also hit dense prefill (~-6% vs stock): on a long sequential scan the launch `grid.z` collapses from `S_v/4 = 32` to `S_v/(16*8) = 1` and the SMs starve (profiled: `gated_delta_net` +54% GPU time = the whole dense-prefill regression). Gate the geometry by per-call scan length: long scans (prefill, `n_tokens >= GDN_PREFILL_NTOK`, default 256) take stock's high-grid.z `(4,1)` geometry; short scans (decode) keep the `(16,8)` retune. Recovers dense prefill +7.2% back to stock parity, keeps the decode win. `GDN_PREFILL_NTOK` tunes the crossover; an explicit `GDN_NW`/`GDN_CPW` sweep still overrides (gate yields when either is set), so the one-build %peak A/B harness is unchanged. | yes (patch 0022 proved every `{NW,CPW}` variant byte-identical, so switching geometry by scan length cannot move the md5) |
### Speculative / MTP investigation (0054, 0055)
| # | What it does | Bit-exact / effect |
|---|---|---|
| 0054 | **Disable backend sampling for MTP drafts** - forces server MTP draft generation through the target-side sampler acceptance path instead of letting the draft backend sample independently. This was required for the Phase 14 rollback/prefix safety gate. | yes for canonical non-MTP gates; Phase 14 MTP normalized greedy-prefix gate passed |
| 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) |
> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
> landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all**

View File

@@ -1253,3 +1253,53 @@ Conclusion:
production candidate must reduce `mmq_nvfp4` or activation movement directly,
stay free of D2H id readback and new stream synchronizations, and then pass
the same md5/op gates before any serving A/B is considered.
## Phase 18 MTP Shape Trace
Phase 18 implemented the Phase 17 instrumentation-only recommendation as
patch `0055-feat-server-trace-speculative-batch-shapes.patch`.
Implementation summary:
- Added default-off `LLAMA_SPEC_SHAPE_TRACE=1` logging in
`server_slot::handle_last_sampled_token()`.
- Normal decode logs one row/output per slot.
- MTP verification logs `K + 1` rows/outputs per speculative slot, including
draft length and `slot.spec_i_batch` range.
- No scheduler, graph-key, KV, logits, acceptance, or rollback behavior changed.
Red/green trace artifacts:
- Red check before patch: `/home/mudler/bench/phase18_mtp_shape_trace_red`
- Green check after patch: `/home/mudler/bench/phase18_mtp_shape_trace_green`
Green trace sample:
```text
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6
spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9
```
Disabled-env check:
- `LLAMA_SPEC_SHAPE_TRACE` unset emitted no `spec shape:` lines.
Inference gate artifact:
- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`
Safety result:
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
Conclusion:
- Patch 0055 is safe instrumentation and does not break inferencing on the
canonical gated paths.
- The trace confirms per-step MTP verification shape variation even in a tiny
request (`rows=4` and `rows=3`).
- A follow-up scheduler experiment is not yet justified. First use this trace
under real serving load to measure draft-length bucket entropy.

View File

@@ -244,6 +244,24 @@ rollback semantics. If reopened, start with a server-only shape counter around
group/defer-by-draft-length scheduler experiment, with TTFT/throughput and
md5/op gates as kill criteria.
Phase 18 added the server-only shape trace as patch 0055. Set
`LLAMA_SPEC_SHAPE_TRACE=1` to log `kind=decode` rows and MTP `kind=verify`
`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`.
This is default-off instrumentation only. DGX green check after the patch saw
MTP verify shapes vary (`rows=4`, then `rows=3`) on a tiny request, while the
env-unset run emitted no `spec shape:` lines. Canonical post-patch gates passed:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Artifacts:
`/home/mudler/bench/phase18_mtp_shape_trace_green` and
`/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`.
Next MTP step, if any: trace real serving shape entropy first. Do not implement
a scheduler change until the trace shows repeatable draft-length buckets worth
grouping. Any scheduler experiment must be opt-in/default-off and killed by
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
rollback/prefix gate failure.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -554,6 +554,38 @@ Only after that should an opt-in scheduling experiment group/defer MTP
verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT
or throughput regresses, graph reuse does not recover, or the md5/op gates drift.
### Phase 18 MTP shape trace
Phase 18 added that instrumentation-only patch as 0055. Set
`LLAMA_SPEC_SHAPE_TRACE=1` to log normal decode rows and MTP verification
`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`.
It is default-off and does not change scheduling, graph keys, logits, KV state,
acceptance, or rollback behavior.
Red/green result:
- before patch, `LLAMA_SPEC_SHAPE_TRACE=1` emitted no `spec shape:` lines;
- after patch, a tiny MTP request emitted `kind=verify` shapes with `rows=4`
and `rows=3`;
- with the env var unset, the patched server emitted no `spec shape:` lines.
Canonical post-patch inference gates stayed green:
- MoE `8cb0ce23777bf55f92f63d0292c756b0`;
- dense `5951a5b4d624ce891e22ab5fca9bc439`;
- `MUL_MAT_ID` `806/806`.
Artifacts:
- `/home/mudler/bench/phase18_mtp_shape_trace_green`
- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`
Follow-up scope: before any source behavior change, run a trace-only real
serving entropy measurement. Only if repeatable draft-length buckets appear
should an opt-in group/defer-by-draft-length scheduler be built; kill it on
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
rollback/prefix gate failure.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,57 @@
From fb9402661291e0488a3e2bf2f3948ebcd18e18c9 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 1 Jul 2026 02:41:22 +0000
Subject: [PATCH] feat(server): trace speculative batch shapes
Add an env-gated LLAMA_SPEC_SHAPE_TRACE log around the server batch rows emitted by normal decode and speculative verification slots. This keeps the instrumentation default-off while exposing the row/output shape entropy that prevents CUDA graph reuse under MTP serving.
Assisted-by: Codex:gpt-5
---
tools/server/server-context.cpp | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index a77e2676d..fd8348af6 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -457,12 +457,22 @@ struct server_slot {
// add sampled token of this slot to the batch, optionally add the speculative draft tokens if any
void handle_last_sampled_token(server_batch & batch) {
+ static const bool spec_shape_trace = getenv("LLAMA_SPEC_SHAPE_TRACE") != nullptr;
+ const int32_t batch_before = batch.size();
+
bool add_ok = true;
if (spec_draft.empty()) {
// no speculative decoding
- i_batch = batch.size();
+ i_batch = batch_before;
+
+ const int32_t pos0 = prompt.tokens.pos_next();
+
+ add_ok &= batch.add(id, sampled, pos0, true);
- add_ok &= batch.add(id, sampled, prompt.tokens.pos_next(), true);
+ if (spec_shape_trace) {
+ SLT_INF(*this, "spec shape: kind=decode batch_before=%d rows=1 outputs=1 draft=0 pos0=%d slot_tokens=%zu\n",
+ batch_before, pos0, prompt.tokens.size());
+ }
SLT_DBG(*this, "slot decode token, id=%d, n_ctx = %d, n_tokens = %d, truncated = %d\n",
sampled, n_ctx, prompt.n_tokens(), truncated);
@@ -479,6 +489,12 @@ struct server_slot {
auto pos0 = prompt.tokens.pos_next();
+ if (spec_shape_trace) {
+ SLT_INF(*this, "spec shape: kind=verify batch_before=%d rows=%zu outputs=%zu draft=%zu spec_i_first=%d spec_i_last=%d pos0=%d slot_tokens=%zu\n",
+ batch_before, spec_draft.size() + 1, spec_draft.size() + 1, spec_draft.size(),
+ spec_i_batch.front(), spec_i_batch.back(), pos0, prompt.tokens.size());
+ }
+
add_ok &= batch.add(id, sampled, pos0++, true);
for (auto token : spec_draft) {
add_ok &= batch.add(this->id, token, pos0++, true);
--
2.43.0