mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): gate MTP rollback safety
Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -840,6 +840,51 @@ Decision:
|
||||
- Do not benchmark MTP as a parity win until a serving/API phase adds rollback
|
||||
gates for hybrid SSM/KV state and measures target verification throughput.
|
||||
|
||||
## Phase 14 MTP Rollback and Inference-Safety Gate
|
||||
|
||||
Phase 14 tested the missing safety question from Phase 9: whether MTP
|
||||
speculative rejection can run against the actual Qwen3.6 MoE GGUF without
|
||||
corrupting paged KV or recurrent GDN state.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err`
|
||||
- `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`
|
||||
- `/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.out`
|
||||
- `/home/mudler/bench/phase14_mtp_rollback/mtp_n{8,16,24,48}.out`
|
||||
- `/home/mudler/bench/paged_inference_gates/20260701_041117`
|
||||
|
||||
Safety evidence:
|
||||
|
||||
- `test-recurrent-state-rollback` on
|
||||
`/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` exited `0` and logged
|
||||
`recurrent rollback checkpoint restored successfully`.
|
||||
- MTP stderr logged bounded recurrent rollback support:
|
||||
`the context supports bounded partial sequence removal`.
|
||||
- MTP partial rejection occurred at `temp=0`:
|
||||
`n_drafted=39`, `n_accept=20`, `accept=51.282%`.
|
||||
- The backend sampler multi-output error stayed absent; the expected
|
||||
`backend draft sampling is disabled for MTP` warning was present.
|
||||
- Raw greedy text was prefix-equivalent after normalization for
|
||||
`n=8,16,24,32,48`; no first differing token was found. Exact transcript md5
|
||||
is not used for this cross-frontend gate because `llama-speculative-simple`
|
||||
emits accepted token groups and can overrun `llama-completion -no-cnv` for
|
||||
the same `-n`.
|
||||
|
||||
Normal inference gates after Phase 14:
|
||||
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`.
|
||||
|
||||
Decision:
|
||||
|
||||
- MTP rollback safety is green enough to scope a Phase 15 serving/API
|
||||
throughput gate.
|
||||
- Do not enable MTP by default.
|
||||
- Do not count MTP as a GB10 speed-parity win until serving results show useful
|
||||
target-verification throughput under the canonical inference gates.
|
||||
|
||||
## Phase 10 GDN C32 Slab Baseline and Source Check
|
||||
|
||||
Phase 10 starts a separate GDN prefill path; it does not reopen the rejected
|
||||
|
||||
@@ -206,7 +206,7 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher
|
||||
| S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover |
|
||||
| whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall |
|
||||
| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered |
|
||||
| speculative decode (MTP) | draft + verify | ORTHOGONAL, not pursued | both engines have it; crux is hybrid-SSM in-place-state (0018) rollback; a feature both can add, not a paged-specific gap |
|
||||
| speculative decode (MTP) | draft + verify | SAFETY-GATED, default-off | Phase 14 passed recurrent rollback, partial rejection, normalized greedy-prefix, and canonical inference gates; still needs serving/API throughput proof before any parity claim |
|
||||
|
||||
### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress
|
||||
- **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off).
|
||||
@@ -225,6 +225,12 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher
|
||||
|
||||
The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon.
|
||||
|
||||
Phase 14 re-validated the MTP bucket as a separate default-off workstream:
|
||||
rollback and ordinary inference safety are now gated, but speed parity is not
|
||||
claimed. The serving follow-up must keep the same fixed gates before and after
|
||||
any benchmark: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM.
|
||||
| S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS |
|
||||
| whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS |
|
||||
| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS |
|
||||
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **ORTHOGONAL, not pursued** | both engines have it; the crux is hybrid-SSM in-place-state (0018) rollback. Not a paged-specific gap - a feature both can add | LMAP |
|
||||
| speculative decode (MTP) | draft + verify; greedy is bit-exact | **SAFETY-GATED, default-off** | Phase 14 passed recurrent rollback on the actual MoE GGUF, partial rejection, normalized greedy-prefix, and canonical inference gates; still not a paged-specific gap and still needs serving/API throughput proof before any parity claim | LMAP |
|
||||
|
||||
The serving regime was the one place the static-bench parity did not carry over
|
||||
(paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable
|
||||
|
||||
@@ -454,8 +454,32 @@ Phase 9 adds a narrow MTP smoke gate instead of production enablement:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
|
||||
MTP remains opt-in and exploratory. It does not supersede the next GDN prefill
|
||||
scope until a serving phase proves target-verification cost and rollback safety.
|
||||
MTP remains opt-in and, after Phase 14, safety-gated but not throughput-proven.
|
||||
It does not supersede the next GDN prefill scope until a serving phase proves
|
||||
target-verification cost.
|
||||
|
||||
### Phase 14 MTP rollback update
|
||||
|
||||
Phase 14 closes the safety gap left open by Phase 9, but still does not claim a
|
||||
throughput/parity win:
|
||||
|
||||
- `test-recurrent-state-rollback` passed on the actual MoE GGUF and logged
|
||||
`recurrent rollback checkpoint restored successfully`.
|
||||
- MTP stderr showed bounded recurrent rollback support:
|
||||
`the context supports bounded partial sequence removal`.
|
||||
- A partial-rejection run produced `n_drafted=39`, `n_accept=20`,
|
||||
`accept=51.282%` with no backend sampler multi-output error.
|
||||
- Canonical inference gates stayed green after the MTP work:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
The greedy-equivalence gate uses normalized raw-output prefix comparison rather
|
||||
than exact transcript md5 because `llama-speculative-simple` emits accepted
|
||||
token groups and can produce a longer completion than `llama-completion -no-cnv`
|
||||
for the same `-n`. Across `n=8,16,24,32,48`, no first differing token was found.
|
||||
|
||||
Next step: Phase 15 may benchmark serving/API throughput with MTP still
|
||||
default-off and only behind the canonical inference gates.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user