mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): add MTP shape trace patch
Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -87,7 +87,7 @@ orthogonal to the paged allocator.
|
||||
|
||||
---
|
||||
|
||||
## 3. Patch series (0001-0047)
|
||||
## 3. Patch series (0001-0055)
|
||||
|
||||
Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
|
||||
decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
|
||||
@@ -207,6 +207,13 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
|
||||
| 0047 | **GDN M5 tensor-core chunked-scan prefill, f32-only re-port, default-ON under paged KV** - the f32/tf32 tensor-core forms of 0031's scan (KK/QK Gram = M2, KS/QS state-boundary 3xtf32 = M3, P*U output = M4, full form-T solve + state-update mma = M5), single build, runtime-selected by `GDN_TC`. Ships **M5 default-on when `LLAMA_KV_PAGED` is set** (`GDN_TC=5` + `GDN_CHUNK_MIN=64`, both env-overridable; OFF/`INT_MAX` when not paged). `GDN_CHUNK_MIN` is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence (at 1 it swallows decode and drops S_TG ~25%); 64 tuned from a {1,32,64,128,256} sweep. The bf16/hybrid dev-tree machinery (STATE_BF16/HYBRID, the dropped 0026 ssm_bf16_tau) and the bf16 CONFIG-C (M8) plus register-resident M6/M7 variants are NOT part of this f32-only series. MoE prefill S_PP +3.5% @npp512 (3x A/B), +17.7% @npp2048; decode S_TG unchanged. | NEW per-path, benign (`test-backend-ops` GATED_DELTA_NET 46/46 default AND force-M5, incl. multi-chunk/tail-chunk/multi-seq; greedy md5 default-on == M5-forced == canonical on the gate prompt: paged-MoE `8cb0ce23`, dense `5951a5b4`; long MoE prompt = one benign greedy flip vs sequential, dense byte-identical) |
|
||||
| 0046 | **GDN prefill geometry gated by scan length** - patch 0022's `(NUM_WARPS=16, COLS_PER_WARP=8)` column-fold of the GDN sequential-recurrence dispatch (`case 128`) is a decode win but was applied UNCONDITIONALLY, so it also hit dense prefill (~-6% vs stock): on a long sequential scan the launch `grid.z` collapses from `S_v/4 = 32` to `S_v/(16*8) = 1` and the SMs starve (profiled: `gated_delta_net` +54% GPU time = the whole dense-prefill regression). Gate the geometry by per-call scan length: long scans (prefill, `n_tokens >= GDN_PREFILL_NTOK`, default 256) take stock's high-grid.z `(4,1)` geometry; short scans (decode) keep the `(16,8)` retune. Recovers dense prefill +7.2% back to stock parity, keeps the decode win. `GDN_PREFILL_NTOK` tunes the crossover; an explicit `GDN_NW`/`GDN_CPW` sweep still overrides (gate yields when either is set), so the one-build %peak A/B harness is unchanged. | yes (patch 0022 proved every `{NW,CPW}` variant byte-identical, so switching geometry by scan length cannot move the md5) |
|
||||
|
||||
### Speculative / MTP investigation (0054, 0055)
|
||||
|
||||
| # | What it does | Bit-exact / effect |
|
||||
|---|---|---|
|
||||
| 0054 | **Disable backend sampling for MTP drafts** - forces server MTP draft generation through the target-side sampler acceptance path instead of letting the draft backend sample independently. This was required for the Phase 14 rollback/prefix safety gate. | yes for canonical non-MTP gates; Phase 14 MTP normalized greedy-prefix gate passed |
|
||||
| 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) |
|
||||
|
||||
> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
|
||||
> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
|
||||
> landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all**
|
||||
|
||||
@@ -1253,3 +1253,53 @@ Conclusion:
|
||||
production candidate must reduce `mmq_nvfp4` or activation movement directly,
|
||||
stay free of D2H id readback and new stream synchronizations, and then pass
|
||||
the same md5/op gates before any serving A/B is considered.
|
||||
|
||||
## Phase 18 MTP Shape Trace
|
||||
|
||||
Phase 18 implemented the Phase 17 instrumentation-only recommendation as
|
||||
patch `0055-feat-server-trace-speculative-batch-shapes.patch`.
|
||||
|
||||
Implementation summary:
|
||||
|
||||
- Added default-off `LLAMA_SPEC_SHAPE_TRACE=1` logging in
|
||||
`server_slot::handle_last_sampled_token()`.
|
||||
- Normal decode logs one row/output per slot.
|
||||
- MTP verification logs `K + 1` rows/outputs per speculative slot, including
|
||||
draft length and `slot.spec_i_batch` range.
|
||||
- No scheduler, graph-key, KV, logits, acceptance, or rollback behavior changed.
|
||||
|
||||
Red/green trace artifacts:
|
||||
|
||||
- Red check before patch: `/home/mudler/bench/phase18_mtp_shape_trace_red`
|
||||
- Green check after patch: `/home/mudler/bench/phase18_mtp_shape_trace_green`
|
||||
|
||||
Green trace sample:
|
||||
|
||||
```text
|
||||
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5
|
||||
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6
|
||||
spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9
|
||||
```
|
||||
|
||||
Disabled-env check:
|
||||
|
||||
- `LLAMA_SPEC_SHAPE_TRACE` unset emitted no `spec shape:` lines.
|
||||
|
||||
Inference gate artifact:
|
||||
|
||||
- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`
|
||||
|
||||
Safety result:
|
||||
|
||||
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
|
||||
|
||||
Conclusion:
|
||||
|
||||
- Patch 0055 is safe instrumentation and does not break inferencing on the
|
||||
canonical gated paths.
|
||||
- The trace confirms per-step MTP verification shape variation even in a tiny
|
||||
request (`rows=4` and `rows=3`).
|
||||
- A follow-up scheduler experiment is not yet justified. First use this trace
|
||||
under real serving load to measure draft-length bucket entropy.
|
||||
|
||||
@@ -244,6 +244,24 @@ rollback semantics. If reopened, start with a server-only shape counter around
|
||||
group/defer-by-draft-length scheduler experiment, with TTFT/throughput and
|
||||
md5/op gates as kill criteria.
|
||||
|
||||
Phase 18 added the server-only shape trace as patch 0055. Set
|
||||
`LLAMA_SPEC_SHAPE_TRACE=1` to log `kind=decode` rows and MTP `kind=verify`
|
||||
`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`.
|
||||
This is default-off instrumentation only. DGX green check after the patch saw
|
||||
MTP verify shapes vary (`rows=4`, then `rows=3`) on a tiny request, while the
|
||||
env-unset run emitted no `spec shape:` lines. Canonical post-patch gates passed:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
Artifacts:
|
||||
`/home/mudler/bench/phase18_mtp_shape_trace_green` and
|
||||
`/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`.
|
||||
|
||||
Next MTP step, if any: trace real serving shape entropy first. Do not implement
|
||||
a scheduler change until the trace shows repeatable draft-length buckets worth
|
||||
grouping. Any scheduler experiment must be opt-in/default-off and killed by
|
||||
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
|
||||
rollback/prefix gate failure.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -554,6 +554,38 @@ Only after that should an opt-in scheduling experiment group/defer MTP
|
||||
verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT
|
||||
or throughput regresses, graph reuse does not recover, or the md5/op gates drift.
|
||||
|
||||
### Phase 18 MTP shape trace
|
||||
|
||||
Phase 18 added that instrumentation-only patch as 0055. Set
|
||||
`LLAMA_SPEC_SHAPE_TRACE=1` to log normal decode rows and MTP verification
|
||||
`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`.
|
||||
It is default-off and does not change scheduling, graph keys, logits, KV state,
|
||||
acceptance, or rollback behavior.
|
||||
|
||||
Red/green result:
|
||||
|
||||
- before patch, `LLAMA_SPEC_SHAPE_TRACE=1` emitted no `spec shape:` lines;
|
||||
- after patch, a tiny MTP request emitted `kind=verify` shapes with `rows=4`
|
||||
and `rows=3`;
|
||||
- with the env var unset, the patched server emitted no `spec shape:` lines.
|
||||
|
||||
Canonical post-patch inference gates stayed green:
|
||||
|
||||
- MoE `8cb0ce23777bf55f92f63d0292c756b0`;
|
||||
- dense `5951a5b4d624ce891e22ab5fca9bc439`;
|
||||
- `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase18_mtp_shape_trace_green`
|
||||
- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`
|
||||
|
||||
Follow-up scope: before any source behavior change, run a trace-only real
|
||||
serving entropy measurement. Only if repeatable draft-length buckets appear
|
||||
should an opt-in group/defer-by-draft-length scheduler be built; kill it on
|
||||
TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP
|
||||
rollback/prefix gate failure.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,57 @@
|
||||
From fb9402661291e0488a3e2bf2f3948ebcd18e18c9 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Wed, 1 Jul 2026 02:41:22 +0000
|
||||
Subject: [PATCH] feat(server): trace speculative batch shapes
|
||||
|
||||
Add an env-gated LLAMA_SPEC_SHAPE_TRACE log around the server batch rows emitted by normal decode and speculative verification slots. This keeps the instrumentation default-off while exposing the row/output shape entropy that prevents CUDA graph reuse under MTP serving.
|
||||
|
||||
Assisted-by: Codex:gpt-5
|
||||
---
|
||||
tools/server/server-context.cpp | 20 ++++++++++++++++++--
|
||||
1 file changed, 18 insertions(+), 2 deletions(-)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index a77e2676d..fd8348af6 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -457,12 +457,22 @@ struct server_slot {
|
||||
|
||||
// add sampled token of this slot to the batch, optionally add the speculative draft tokens if any
|
||||
void handle_last_sampled_token(server_batch & batch) {
|
||||
+ static const bool spec_shape_trace = getenv("LLAMA_SPEC_SHAPE_TRACE") != nullptr;
|
||||
+ const int32_t batch_before = batch.size();
|
||||
+
|
||||
bool add_ok = true;
|
||||
if (spec_draft.empty()) {
|
||||
// no speculative decoding
|
||||
- i_batch = batch.size();
|
||||
+ i_batch = batch_before;
|
||||
+
|
||||
+ const int32_t pos0 = prompt.tokens.pos_next();
|
||||
+
|
||||
+ add_ok &= batch.add(id, sampled, pos0, true);
|
||||
|
||||
- add_ok &= batch.add(id, sampled, prompt.tokens.pos_next(), true);
|
||||
+ if (spec_shape_trace) {
|
||||
+ SLT_INF(*this, "spec shape: kind=decode batch_before=%d rows=1 outputs=1 draft=0 pos0=%d slot_tokens=%zu\n",
|
||||
+ batch_before, pos0, prompt.tokens.size());
|
||||
+ }
|
||||
|
||||
SLT_DBG(*this, "slot decode token, id=%d, n_ctx = %d, n_tokens = %d, truncated = %d\n",
|
||||
sampled, n_ctx, prompt.n_tokens(), truncated);
|
||||
@@ -479,6 +489,12 @@ struct server_slot {
|
||||
|
||||
auto pos0 = prompt.tokens.pos_next();
|
||||
|
||||
+ if (spec_shape_trace) {
|
||||
+ SLT_INF(*this, "spec shape: kind=verify batch_before=%d rows=%zu outputs=%zu draft=%zu spec_i_first=%d spec_i_last=%d pos0=%d slot_tokens=%zu\n",
|
||||
+ batch_before, spec_draft.size() + 1, spec_draft.size() + 1, spec_draft.size(),
|
||||
+ spec_i_batch.front(), spec_i_batch.back(), pos0, prompt.tokens.size());
|
||||
+ }
|
||||
+
|
||||
add_ok &= batch.add(id, sampled, pos0++, true);
|
||||
for (auto token : spec_draft) {
|
||||
add_ok &= batch.add(this->id, token, pos0++, true);
|
||||
--
|
||||
2.43.0
|
||||
|
||||
112
docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md
Normal file
112
docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# MTP Shape Trace Phase 18 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILLS: Use
|
||||
> superpowers:test-driven-development before source edits and
|
||||
> superpowers:verification-before-completion before commit. Steps use checkbox
|
||||
> (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** add a default-off, inference-safe trace for speculative/MTP server
|
||||
batch shape entropy before considering any scheduler experiment.
|
||||
|
||||
**Architecture:** keep this as a server-only instrumentation patch in
|
||||
`server_slot::handle_last_sampled_token()`. Do not change speculative
|
||||
acceptance, rollback, logits, KV writes, graph-reuse keys, or scheduling.
|
||||
|
||||
**Tech Stack:** llama.cpp `tools/server/server-context.cpp`, LocalAI paged
|
||||
patch stack, DGX GB10 validation.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Red Check
|
||||
|
||||
- [x] **Step 1: Prove the trace does not already exist**
|
||||
|
||||
Ran a direct MTP `llama-server` request on DGX with
|
||||
`LLAMA_SPEC_SHAPE_TRACE=1` before the source patch.
|
||||
|
||||
Result:
|
||||
|
||||
- no `spec shape:` lines were emitted,
|
||||
- artifact: `/home/mudler/bench/phase18_mtp_shape_trace_red`.
|
||||
|
||||
## Task 2: Instrumentation Patch
|
||||
|
||||
- [x] **Step 1: Add an env-gated trace**
|
||||
|
||||
Added `LLAMA_SPEC_SHAPE_TRACE=1` logging in
|
||||
`server_slot::handle_last_sampled_token()`:
|
||||
|
||||
- normal decode rows: `kind=decode`, `rows=1`, `outputs=1`, `draft=0`,
|
||||
- speculative verification rows: `kind=verify`, `rows=K+1`,
|
||||
`outputs=K+1`, `draft=K`, `spec_i_first`, `spec_i_last`.
|
||||
|
||||
The env var is default-off and does not alter batch contents.
|
||||
|
||||
- [x] **Step 2: Keep the patch incremental**
|
||||
|
||||
Local fork commit:
|
||||
|
||||
- `fb9402661 feat(server): trace speculative batch shapes`
|
||||
|
||||
LocalAI patch:
|
||||
|
||||
- `0055-feat-server-trace-speculative-batch-shapes.patch`
|
||||
|
||||
## Task 3: Green Checks
|
||||
|
||||
- [x] **Step 1: Build and validate trace behavior on DGX**
|
||||
|
||||
DGX mirror commit:
|
||||
|
||||
- `f2521ab12 feat(server): trace speculative batch shapes`
|
||||
|
||||
Build:
|
||||
|
||||
- `cmake --build build-cuda --target llama-server -j$(nproc)`
|
||||
|
||||
Trace-enabled result:
|
||||
|
||||
```text
|
||||
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5
|
||||
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6
|
||||
spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9
|
||||
```
|
||||
|
||||
Trace-disabled result:
|
||||
|
||||
```text
|
||||
trace disabled: no spec shape lines
|
||||
```
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase18_mtp_shape_trace_green`
|
||||
|
||||
- [x] **Step 2: Run canonical inference gates**
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`
|
||||
|
||||
Result:
|
||||
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT_ID`: `806/806`
|
||||
|
||||
## Task 4: Follow-Up Boundary
|
||||
|
||||
- [x] **Step 1: Scope Phase 19**
|
||||
|
||||
Use the trace to measure shape entropy under real serving load before any
|
||||
behavior change. A Phase 19 scheduler experiment is allowed only if the trace
|
||||
shows repeatable draft-length buckets worth grouping. It must be opt-in,
|
||||
default-off, and killed by TTFT/throughput regression, md5/op drift, or MTP
|
||||
rollback/prefix failure.
|
||||
|
||||
## Self-Review
|
||||
|
||||
- No default behavior changed.
|
||||
- The trace is read-only with respect to batch contents and slot state.
|
||||
- The post-patch canonical md5/op gates passed, so this instrumentation did not
|
||||
break inferencing on the gated paths.
|
||||
Reference in New Issue
Block a user