diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index d8bde023c..5bb0de66a 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -840,6 +840,51 @@ Decision: - Do not benchmark MTP as a parity win until a serving/API phase adds rollback gates for hybrid SSM/KV state and measures target verification throughput. +## Phase 14 MTP Rollback and Inference-Safety Gate + +Phase 14 tested the missing safety question from Phase 9: whether MTP +speculative rejection can run against the actual Qwen3.6 MoE GGUF without +corrupting paged KV or recurrent GDN state. + +Artifacts: + +- `/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err` +- `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err` +- `/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.out` +- `/home/mudler/bench/phase14_mtp_rollback/mtp_n{8,16,24,48}.out` +- `/home/mudler/bench/paged_inference_gates/20260701_041117` + +Safety evidence: + +- `test-recurrent-state-rollback` on + `/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` exited `0` and logged + `recurrent rollback checkpoint restored successfully`. +- MTP stderr logged bounded recurrent rollback support: + `the context supports bounded partial sequence removal`. +- MTP partial rejection occurred at `temp=0`: + `n_drafted=39`, `n_accept=20`, `accept=51.282%`. +- The backend sampler multi-output error stayed absent; the expected + `backend draft sampling is disabled for MTP` warning was present. +- Raw greedy text was prefix-equivalent after normalization for + `n=8,16,24,32,48`; no first differing token was found. Exact transcript md5 + is not used for this cross-frontend gate because `llama-speculative-simple` + emits accepted token groups and can overrun `llama-completion -no-cnv` for + the same `-n`. + +Normal inference gates after Phase 14: + +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`. + +Decision: + +- MTP rollback safety is green enough to scope a Phase 15 serving/API + throughput gate. +- Do not enable MTP by default. +- Do not count MTP as a GB10 speed-parity win until serving results show useful + target-verification throughput under the canonical inference gates. + ## Phase 10 GDN C32 Slab Baseline and Source Check Phase 10 starts a separate GDN prefill path; it does not reopen the rejected diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index ece552a5e..93d38b969 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -206,7 +206,7 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher | S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover | | whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall | | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered | -| speculative decode (MTP) | draft + verify | ORTHOGONAL, not pursued | both engines have it; crux is hybrid-SSM in-place-state (0018) rollback; a feature both can add, not a paged-specific gap | +| speculative decode (MTP) | draft + verify | SAFETY-GATED, default-off | Phase 14 passed recurrent rollback, partial rejection, normalized greedy-prefix, and canonical inference gates; still needs serving/API throughput proof before any parity claim | ### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress - **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off). @@ -225,6 +225,12 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon. +Phase 14 re-validated the MTP bucket as a separate default-off workstream: +rollback and ordinary inference safety are now gated, but speed parity is not +claimed. The serving follow-up must keep the same fixed gates before and after +any benchmark: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index ffd86a5cf..bc9ac3a40 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM. | S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS | | whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS | | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS | -| speculative decode (MTP) | draft + verify; greedy is bit-exact | **ORTHOGONAL, not pursued** | both engines have it; the crux is hybrid-SSM in-place-state (0018) rollback. Not a paged-specific gap - a feature both can add | LMAP | +| speculative decode (MTP) | draft + verify; greedy is bit-exact | **SAFETY-GATED, default-off** | Phase 14 passed recurrent rollback on the actual MoE GGUF, partial rejection, normalized greedy-prefix, and canonical inference gates; still not a paged-specific gap and still needs serving/API throughput proof before any parity claim | LMAP | The serving regime was the one place the static-bench parity did not carry over (paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 05b1ca092..886eb5939 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -454,8 +454,32 @@ Phase 9 adds a narrow MTP smoke gate instead of production enablement: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`. -MTP remains opt-in and exploratory. It does not supersede the next GDN prefill -scope until a serving phase proves target-verification cost and rollback safety. +MTP remains opt-in and, after Phase 14, safety-gated but not throughput-proven. +It does not supersede the next GDN prefill scope until a serving phase proves +target-verification cost. + +### Phase 14 MTP rollback update + +Phase 14 closes the safety gap left open by Phase 9, but still does not claim a +throughput/parity win: + +- `test-recurrent-state-rollback` passed on the actual MoE GGUF and logged + `recurrent rollback checkpoint restored successfully`. +- MTP stderr showed bounded recurrent rollback support: + `the context supports bounded partial sequence removal`. +- A partial-rejection run produced `n_drafted=39`, `n_accept=20`, + `accept=51.282%` with no backend sampler multi-output error. +- Canonical inference gates stayed green after the MTP work: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +The greedy-equivalence gate uses normalized raw-output prefix comparison rather +than exact transcript md5 because `llama-speculative-simple` emits accepted +token groups and can produce a longer completion than `llama-completion -no-cnv` +for the same `-n`. Across `n=8,16,24,32,48`, no first differing token was found. + +Next step: Phase 15 may benchmark serving/API throughput with MTP still +default-off and only behind the canonical inference gates. Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. diff --git a/docs/superpowers/plans/2026-07-01-mtp-rollback-serving-gates-phase14.md b/docs/superpowers/plans/2026-07-01-mtp-rollback-serving-gates-phase14.md new file mode 100644 index 000000000..3b4c900ff --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-rollback-serving-gates-phase14.md @@ -0,0 +1,195 @@ +# MTP Rollback and Serving Gates Phase 14 Plan + +> **For agentic workers:** keep checkboxes current while executing. This phase +> is safety-gated and must not claim an MTP parity win. + +**Goal:** prove that MTP speculative decode can reject drafts without corrupting +Qwen3.6 paged KV or recurrent GDN state. + +**Design:** `docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md` + +## Required Safety Gates + +- DGX must have no running docker containers, no `local-ai-worker`, no GPU + compute PIDs, and a free or absent `~/gpu_bench_lock/owner`. +- Use `/home/mudler/llama-phase6-source` on DGX and keep it clean unless a + source patch is explicitly required. +- Do not benchmark MTP as a parity win in this phase. +- Do not enable MTP by default in LocalAI or llama-server. + +## Task 1: Preflight and Existing Rollback Gate + +- [x] **Step 1: Confirm DGX is free** + + Result: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + FREE released-by-codex-phase6-mmq-grid 1782860601 + ``` + +- [x] **Step 2: Run recurrent rollback test on actual MoE GGUF** + + Command: + + ```bash + ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && + cmake --build . --target test-recurrent-state-rollback -j 8 && + ./bin/test-recurrent-state-rollback \ + -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -ngl 99 -fa on -c 4096 -b 64 -ub 64 \ + > /home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.out \ + 2> /home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err' + ``` + + Current evidence from the same command family: + + - Artifact: + `/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err`. + - Result: + `main : recurrent rollback checkpoint restored successfully`. + +## Task 2: MTP Greedy-Equivalence Gate + +- [x] **Step 1: Build required binaries** + + Build `llama-completion`, `llama-speculative-simple`, and + `test-recurrent-state-rollback`. + +- [x] **Step 2: Run baseline greedy completion** + + Save stdout/stderr and md5 under + `/home/mudler/bench/phase14_mtp_rollback/greedy_baseline.*`. + + Additional raw text-generation baselines were saved under + `/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.*` + because `llama-completion` defaults to conversation mode for this model unless + `-no-cnv` is passed. + +- [x] **Step 3: Run MTP speculative completion with the same prompt/seed** + + Use: + + - `--spec-type draft-mtp` + - `--spec-draft-model /home/mudler/bench/q36-35b-a3b-nvfp4.gguf` + - `--spec-draft-ngl 99` + - `--spec-draft-n-max 3` + - `--temp 0 --seed 1` + + Save stdout/stderr and md5 under + `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.*`. + +- [x] **Step 4: Compare outputs** + + Exact transcript md5 is not a valid cross-frontend comparator here: + + - `llama-speculative-simple --spec-type none` is not a working no-draft + baseline; it still tries to load an empty draft model and exits with + `failed to load draft model, ''`. + - `--spec-draft-n-max 0` is not a no-draft baseline either; the recorded run + still drafted and accepted tokens (`n_drafted=17`, `n_accept=17`). + - `llama-speculative-simple` counts/emits accepted token groups, so the same + `-n` can produce a longer raw completion than `llama-completion -no-cnv`. + + Normalized raw-output prefix gate passed for `n=8,16,24,32,48`; no run showed + a first differing token. The MTP output had the `llama-completion -no-cnv` + output as a prefix in each case. The `n=32` MTP artifact was + `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.out`. + +## Task 3: MTP Partial-Rejection Gate + +- [x] **Step 1: Confirm rejection occurred** + + Parse MTP stderr and require: + + - `n_drafted > 0` + - `n_accept >= 0` + - `n_drafted > n_accept` + + Result from `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`: + + ```text + n_drafted = 39 + n_accept = 20 + accept = 51.282% + ``` + +- [x] **Step 2: Confirm no backend sampler error** + + Fail if stderr contains: + + ```text + backend sampling requires at most one output token per sequence + ``` + + Result: absent from the MTP stderr. The expected warning was present instead: + `backend draft sampling is disabled for MTP`. + +- [x] **Step 3: Record whether bounded recurrent rollback is active** + + Record `n_rs_seq` or the log line showing bounded partial sequence removal. + + Result from `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`: + + ```text + common_context_can_seq_rm: the context supports bounded partial sequence removal + ``` + +## Task 4: Standard Inference Gates + +- [x] **Step 1: Run paged inference gate helper** + + Run: + + ```bash + /tmp/paged-inference-gates.sh + ``` + + Expected: + + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`. + - `MUL_MAT_ID` `806/806`. + + Result: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + artifacts: /home/mudler/bench/paged_inference_gates/20260701_041117 + ``` + +## Task 5: Disposition + +- [x] **Step 1: If all gates pass** + + Update: + + - `GB10_PARITY_PHASE0_RESULTS.md` + - `VLLM_PARITY_LEVER_MAP.md` + - `PARITY_HANDOFF.md` + + Record that MTP rollback safety is green and Phase 15 can be a serving/API + benchmark, still default-off. + +- [x] **Step 2: If any gate fails** + + Stop before performance benchmarking, save artifacts, and either implement a + narrow fork-first fix or record the failed gate as a blocker for MTP parity. + + Reviewed and not taken. The original exact-md5 wording was too strict for + this example harness, but there was no token divergence after raw-output + normalization. Do not add a production source patch in Phase 14. Carry the + frontend/token accounting finding into Phase 15 and benchmark serving only + behind the same canonical inference gates. + +## Self-Review + +- No placeholders remain. +- Scope is limited to rollback and greedy-equivalence safety. +- Phase 14 does not claim or benchmark speed parity. diff --git a/docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md b/docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md new file mode 100644 index 000000000..cf76ed130 --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md @@ -0,0 +1,89 @@ +# MTP Rollback and Serving Gates Design + +## Goal + +Move MTP speculative decoding from a smoke-only Phase 9 result to a gated +parity workstream by proving that Qwen3.6 hybrid recurrent state can be rolled +back safely under speculative rejection. + +This phase does not enable MTP by default and does not count MTP as a speed +win. It creates the evidence required before any serving benchmark can be +interpreted as valid. + +## Current Evidence + +Phase 9 proved that: + +- The MoE GGUF contains Qwen3.6 `nextn` tensors. +- `draft-mtp` can run with the current model after backend draft sampling is + disabled for MTP. +- Normal MoE and dense transcript md5 gates remain canonical. + +The missing proof is that speculative rejection restores both memory systems: + +- paged attention KV state, +- gated-DeltaNet recurrent state, including `n_rs_seq` snapshot rollback. + +## Existing Mechanism + +The current fork already contains the mechanism this phase should validate: + +- `common_params_speculative::need_n_rs_seq()` requests recurrent snapshots for + `draft-mtp` and `draft-eagle3`. +- Qwen3.5/Qwen3.6 architectures advertise recurrent rollback support through + `llm_arch_supports_rs_rollback()`. +- `llama_memory_recurrent::seq_rm()` can roll back within the bounded + `n_rs_seq` window by selecting an older recurrent-state snapshot. +- `tests/test-recurrent-state-rollback.cpp` verifies snapshot save/restore and + dirty-context cleanup for recurrent models. + +## Phase 14 Gates + +Phase 14 has three gates: + +1. **Rollback mechanism gate.** Build and run `test-recurrent-state-rollback` + against `/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` on DGX. This proves the + actual model can restore recurrent snapshots and replay logits. +2. **MTP greedy-equivalence gate.** Run baseline greedy completion and MTP + speculative completion on the same prompt/seed and compare normalized raw + text. Exact transcript md5 is only valid when the same frontend emits the + same number of generated tokens. `llama-speculative-simple` commits accepted + token groups, so its output can be longer than `llama-completion -no-cnv` + for the same `-n`. Treat the gate as a safety pass only if one normalized + output is a prefix of the other and there is no first differing token. +3. **MTP partial-rejection gate.** Run an MTP configuration that drafts more + than one token and records `n_drafted > n_accept`, while still matching + greedy output. This proves rejection happened and did not corrupt + inferencing state. + +## Source Policy + +Do not add a production source patch in this phase unless one of the gates fails +and the root cause is isolated. If all gates pass, record the evidence and then +scope a separate serving/API benchmark phase. + +If a source patch is required, it must be fork-first, default-off or +test-only, and must pass: + +- MoE transcript md5 `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5 `5951a5b4d624ce891e22ab5fca9bc439`. +- `test-recurrent-state-rollback` on the actual MoE GGUF. +- The MTP greedy-equivalence and partial-rejection gates. + +## Stop Conditions + +Stop and do not benchmark MTP for speed if: + +- rollback test fails, +- MTP output differs from greedy baseline at `temp=0` after normalizing the + example frontend's leading newlines, +- no run can produce both `n_drafted > 0` and `n_drafted > n_accept`, +- any run requires backend draft sampling for MTP, +- DGX is not free of docker containers, `local-ai-worker`, and GPU compute + processes. + +## Follow-up + +Only after Phase 14 passes should Phase 15 measure serving/API throughput. +Phase 15 must compare non-spec serving against MTP serving with the same prompt +shape, request count, seed behavior, and canonical inference gates.