docs(paged): gate MTP rollback safety

Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 02:15:11 +00:00
parent e169058e73
commit 70394364a3
6 changed files with 363 additions and 4 deletions

View File

@@ -840,6 +840,51 @@ Decision:
- Do not benchmark MTP as a parity win until a serving/API phase adds rollback
gates for hybrid SSM/KV state and measures target verification throughput.
## Phase 14 MTP Rollback and Inference-Safety Gate
Phase 14 tested the missing safety question from Phase 9: whether MTP
speculative rejection can run against the actual Qwen3.6 MoE GGUF without
corrupting paged KV or recurrent GDN state.
Artifacts:
- `/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err`
- `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`
- `/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.out`
- `/home/mudler/bench/phase14_mtp_rollback/mtp_n{8,16,24,48}.out`
- `/home/mudler/bench/paged_inference_gates/20260701_041117`
Safety evidence:
- `test-recurrent-state-rollback` on
`/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` exited `0` and logged
`recurrent rollback checkpoint restored successfully`.
- MTP stderr logged bounded recurrent rollback support:
`the context supports bounded partial sequence removal`.
- MTP partial rejection occurred at `temp=0`:
`n_drafted=39`, `n_accept=20`, `accept=51.282%`.
- The backend sampler multi-output error stayed absent; the expected
`backend draft sampling is disabled for MTP` warning was present.
- Raw greedy text was prefix-equivalent after normalization for
`n=8,16,24,32,48`; no first differing token was found. Exact transcript md5
is not used for this cross-frontend gate because `llama-speculative-simple`
emits accepted token groups and can overrun `llama-completion -no-cnv` for
the same `-n`.
Normal inference gates after Phase 14:
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`.
Decision:
- MTP rollback safety is green enough to scope a Phase 15 serving/API
throughput gate.
- Do not enable MTP by default.
- Do not count MTP as a GB10 speed-parity win until serving results show useful
target-verification throughput under the canonical inference gates.
## Phase 10 GDN C32 Slab Baseline and Source Check
Phase 10 starts a separate GDN prefill path; it does not reopen the rejected

View File

@@ -206,7 +206,7 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher
| S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover |
| whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall |
| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered |
| speculative decode (MTP) | draft + verify | ORTHOGONAL, not pursued | both engines have it; crux is hybrid-SSM in-place-state (0018) rollback; a feature both can add, not a paged-specific gap |
| speculative decode (MTP) | draft + verify | SAFETY-GATED, default-off | Phase 14 passed recurrent rollback, partial rejection, normalized greedy-prefix, and canonical inference gates; still needs serving/API throughput proof before any parity claim |
### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress
- **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off).
@@ -225,6 +225,12 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher
The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon.
Phase 14 re-validated the MTP bucket as a separate default-off workstream:
rollback and ordinary inference safety are now gated, but speed parity is not
claimed. The serving follow-up must keep the same fixed gates before and after
any benchmark: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM.
| S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS |
| whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS |
| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS |
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **ORTHOGONAL, not pursued** | both engines have it; the crux is hybrid-SSM in-place-state (0018) rollback. Not a paged-specific gap - a feature both can add | LMAP |
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **SAFETY-GATED, default-off** | Phase 14 passed recurrent rollback on the actual MoE GGUF, partial rejection, normalized greedy-prefix, and canonical inference gates; still not a paged-specific gap and still needs serving/API throughput proof before any parity claim | LMAP |
The serving regime was the one place the static-bench parity did not carry over
(paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable

View File

@@ -454,8 +454,32 @@ Phase 9 adds a narrow MTP smoke gate instead of production enablement:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`.
MTP remains opt-in and exploratory. It does not supersede the next GDN prefill
scope until a serving phase proves target-verification cost and rollback safety.
MTP remains opt-in and, after Phase 14, safety-gated but not throughput-proven.
It does not supersede the next GDN prefill scope until a serving phase proves
target-verification cost.
### Phase 14 MTP rollback update
Phase 14 closes the safety gap left open by Phase 9, but still does not claim a
throughput/parity win:
- `test-recurrent-state-rollback` passed on the actual MoE GGUF and logged
`recurrent rollback checkpoint restored successfully`.
- MTP stderr showed bounded recurrent rollback support:
`the context supports bounded partial sequence removal`.
- A partial-rejection run produced `n_drafted=39`, `n_accept=20`,
`accept=51.282%` with no backend sampler multi-output error.
- Canonical inference gates stayed green after the MTP work:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
The greedy-equivalence gate uses normalized raw-output prefix comparison rather
than exact transcript md5 because `llama-speculative-simple` emits accepted
token groups and can produce a longer completion than `llama-completion -no-cnv`
for the same `-n`. Across `n=8,16,24,32,48`, no first differing token was found.
Next step: Phase 15 may benchmark serving/API throughput with MTP still
default-off and only behind the canonical inference gates.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

View File

@@ -0,0 +1,195 @@
# MTP Rollback and Serving Gates Phase 14 Plan
> **For agentic workers:** keep checkboxes current while executing. This phase
> is safety-gated and must not claim an MTP parity win.
**Goal:** prove that MTP speculative decode can reject drafts without corrupting
Qwen3.6 paged KV or recurrent GDN state.
**Design:** `docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md`
## Required Safety Gates
- DGX must have no running docker containers, no `local-ai-worker`, no GPU
compute PIDs, and a free or absent `~/gpu_bench_lock/owner`.
- Use `/home/mudler/llama-phase6-source` on DGX and keep it clean unless a
source patch is explicitly required.
- Do not benchmark MTP as a parity win in this phase.
- Do not enable MTP by default in LocalAI or llama-server.
## Task 1: Preflight and Existing Rollback Gate
- [x] **Step 1: Confirm DGX is free**
Result:
```text
docker=0
local_ai_worker=0
compute=0
FREE released-by-codex-phase6-mmq-grid 1782860601
```
- [x] **Step 2: Run recurrent rollback test on actual MoE GGUF**
Command:
```bash
ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda &&
cmake --build . --target test-recurrent-state-rollback -j 8 &&
./bin/test-recurrent-state-rollback \
-m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \
-ngl 99 -fa on -c 4096 -b 64 -ub 64 \
> /home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.out \
2> /home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err'
```
Current evidence from the same command family:
- Artifact:
`/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err`.
- Result:
`main : recurrent rollback checkpoint restored successfully`.
## Task 2: MTP Greedy-Equivalence Gate
- [x] **Step 1: Build required binaries**
Build `llama-completion`, `llama-speculative-simple`, and
`test-recurrent-state-rollback`.
- [x] **Step 2: Run baseline greedy completion**
Save stdout/stderr and md5 under
`/home/mudler/bench/phase14_mtp_rollback/greedy_baseline.*`.
Additional raw text-generation baselines were saved under
`/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.*`
because `llama-completion` defaults to conversation mode for this model unless
`-no-cnv` is passed.
- [x] **Step 3: Run MTP speculative completion with the same prompt/seed**
Use:
- `--spec-type draft-mtp`
- `--spec-draft-model /home/mudler/bench/q36-35b-a3b-nvfp4.gguf`
- `--spec-draft-ngl 99`
- `--spec-draft-n-max 3`
- `--temp 0 --seed 1`
Save stdout/stderr and md5 under
`/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.*`.
- [x] **Step 4: Compare outputs**
Exact transcript md5 is not a valid cross-frontend comparator here:
- `llama-speculative-simple --spec-type none` is not a working no-draft
baseline; it still tries to load an empty draft model and exits with
`failed to load draft model, ''`.
- `--spec-draft-n-max 0` is not a no-draft baseline either; the recorded run
still drafted and accepted tokens (`n_drafted=17`, `n_accept=17`).
- `llama-speculative-simple` counts/emits accepted token groups, so the same
`-n` can produce a longer raw completion than `llama-completion -no-cnv`.
Normalized raw-output prefix gate passed for `n=8,16,24,32,48`; no run showed
a first differing token. The MTP output had the `llama-completion -no-cnv`
output as a prefix in each case. The `n=32` MTP artifact was
`/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.out`.
## Task 3: MTP Partial-Rejection Gate
- [x] **Step 1: Confirm rejection occurred**
Parse MTP stderr and require:
- `n_drafted > 0`
- `n_accept >= 0`
- `n_drafted > n_accept`
Result from `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`:
```text
n_drafted = 39
n_accept = 20
accept = 51.282%
```
- [x] **Step 2: Confirm no backend sampler error**
Fail if stderr contains:
```text
backend sampling requires at most one output token per sequence
```
Result: absent from the MTP stderr. The expected warning was present instead:
`backend draft sampling is disabled for MTP`.
- [x] **Step 3: Record whether bounded recurrent rollback is active**
Record `n_rs_seq` or the log line showing bounded partial sequence removal.
Result from `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`:
```text
common_context_can_seq_rm: the context supports bounded partial sequence removal
```
## Task 4: Standard Inference Gates
- [x] **Step 1: Run paged inference gate helper**
Run:
```bash
/tmp/paged-inference-gates.sh
```
Expected:
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`.
- `MUL_MAT_ID` `806/806`.
Result:
```text
moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0
dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439
806/806 tests passed
Backend CUDA0: OK
paged inference gates OK
artifacts: /home/mudler/bench/paged_inference_gates/20260701_041117
```
## Task 5: Disposition
- [x] **Step 1: If all gates pass**
Update:
- `GB10_PARITY_PHASE0_RESULTS.md`
- `VLLM_PARITY_LEVER_MAP.md`
- `PARITY_HANDOFF.md`
Record that MTP rollback safety is green and Phase 15 can be a serving/API
benchmark, still default-off.
- [x] **Step 2: If any gate fails**
Stop before performance benchmarking, save artifacts, and either implement a
narrow fork-first fix or record the failed gate as a blocker for MTP parity.
Reviewed and not taken. The original exact-md5 wording was too strict for
this example harness, but there was no token divergence after raw-output
normalization. Do not add a production source patch in Phase 14. Carry the
frontend/token accounting finding into Phase 15 and benchmark serving only
behind the same canonical inference gates.
## Self-Review
- No placeholders remain.
- Scope is limited to rollback and greedy-equivalence safety.
- Phase 14 does not claim or benchmark speed parity.

View File

@@ -0,0 +1,89 @@
# MTP Rollback and Serving Gates Design
## Goal
Move MTP speculative decoding from a smoke-only Phase 9 result to a gated
parity workstream by proving that Qwen3.6 hybrid recurrent state can be rolled
back safely under speculative rejection.
This phase does not enable MTP by default and does not count MTP as a speed
win. It creates the evidence required before any serving benchmark can be
interpreted as valid.
## Current Evidence
Phase 9 proved that:
- The MoE GGUF contains Qwen3.6 `nextn` tensors.
- `draft-mtp` can run with the current model after backend draft sampling is
disabled for MTP.
- Normal MoE and dense transcript md5 gates remain canonical.
The missing proof is that speculative rejection restores both memory systems:
- paged attention KV state,
- gated-DeltaNet recurrent state, including `n_rs_seq` snapshot rollback.
## Existing Mechanism
The current fork already contains the mechanism this phase should validate:
- `common_params_speculative::need_n_rs_seq()` requests recurrent snapshots for
`draft-mtp` and `draft-eagle3`.
- Qwen3.5/Qwen3.6 architectures advertise recurrent rollback support through
`llm_arch_supports_rs_rollback()`.
- `llama_memory_recurrent::seq_rm()` can roll back within the bounded
`n_rs_seq` window by selecting an older recurrent-state snapshot.
- `tests/test-recurrent-state-rollback.cpp` verifies snapshot save/restore and
dirty-context cleanup for recurrent models.
## Phase 14 Gates
Phase 14 has three gates:
1. **Rollback mechanism gate.** Build and run `test-recurrent-state-rollback`
against `/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` on DGX. This proves the
actual model can restore recurrent snapshots and replay logits.
2. **MTP greedy-equivalence gate.** Run baseline greedy completion and MTP
speculative completion on the same prompt/seed and compare normalized raw
text. Exact transcript md5 is only valid when the same frontend emits the
same number of generated tokens. `llama-speculative-simple` commits accepted
token groups, so its output can be longer than `llama-completion -no-cnv`
for the same `-n`. Treat the gate as a safety pass only if one normalized
output is a prefix of the other and there is no first differing token.
3. **MTP partial-rejection gate.** Run an MTP configuration that drafts more
than one token and records `n_drafted > n_accept`, while still matching
greedy output. This proves rejection happened and did not corrupt
inferencing state.
## Source Policy
Do not add a production source patch in this phase unless one of the gates fails
and the root cause is isolated. If all gates pass, record the evidence and then
scope a separate serving/API benchmark phase.
If a source patch is required, it must be fork-first, default-off or
test-only, and must pass:
- MoE transcript md5 `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense transcript md5 `5951a5b4d624ce891e22ab5fca9bc439`.
- `test-recurrent-state-rollback` on the actual MoE GGUF.
- The MTP greedy-equivalence and partial-rejection gates.
## Stop Conditions
Stop and do not benchmark MTP for speed if:
- rollback test fails,
- MTP output differs from greedy baseline at `temp=0` after normalizing the
example frontend's leading newlines,
- no run can produce both `n_drafted > 0` and `n_drafted > n_accept`,
- any run requires backend draft sampling for MTP,
- DGX is not free of docker containers, `local-ai-worker`, and GPU compute
processes.
## Follow-up
Only after Phase 14 passes should Phase 15 measure serving/API throughput.
Phase 15 must compare non-spec serving against MTP serving with the same prompt
shape, request count, seed behavior, and canonical inference gates.