diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 2c77be0af..0ad0604a0 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -344,6 +344,12 @@ md5 checks and selected `test-backend-ops` filters, and refuses to start while docker, `local-ai-worker`, GPU compute processes, or a non-free GPU lock are present. +For direct `llama-server` MTP serving A/B work, use +`paged-mtp-serving-bench.sh`. It runs the same pre/post inference gates, compares +baseline vs `--spec-type draft-mtp`, and captures the h2h client summaries plus +MTP acceptance lines. Phase 15 rejected current MTP serving on GB10 despite +passing safety gates; do not enable it by default. + **The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md)). Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5 (`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 5bb0de66a..ba1adca97 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -885,6 +885,65 @@ Decision: - Do not count MTP as a GB10 speed-parity win until serving results show useful target-verification throughput under the canonical inference gates. +## Phase 15 MTP Serving Throughput Gate + +Phase 15 measured the direct `llama-server` serving path after Phase 14 proved +rollback safety. The test compared two same-shape arms: + +- baseline: no speculative decoding, +- MTP: `--spec-type draft-mtp --spec-draft-n-max 3 + --no-spec-draft-backend-sampling`. + +Artifact: + +- `/home/mudler/bench/phase15_mtp_serving/20260701_042005` + +Harness: + +- `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh` +- `NPL="8 32 128" PTOK=128 GEN=128 CTX=131072 PARALLEL=128` +- client: `/home/mudler/bench/h2h_cli3.py` against `/v1/completions` + +Result: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | TTFT mean ms | wall s | +|---|---:|---:|---:|---:|---:|---:| +| baseline | 8 | 192.5 | 247.8 | 30.70 | 1181.1 | 5.318 | +| MTP | 8 | 92.9 | 109.8 | 14.26 | 1691.5 | 11.017 | +| baseline | 32 | 305.4 | 406.0 | 12.02 | 2762.2 | 13.412 | +| MTP | 32 | 95.8 | 111.7 | 3.61 | 4545.6 | 42.727 | +| baseline | 128 | 429.5 | 662.4 | 4.31 | 7747.2 | 38.144 | +| MTP | 128 | 100.3 | 138.5 | 0.97 | 20385.7 | 163.289 | + +MTP did actually run: + +- server initialized `draft-mtp` with bounded partial sequence removal, +- response/server timings included draft counters, +- server log tail included `#gen tokens = 17293`, `#acc tokens = 15493`. + +Normal inference gates before and after the A/B: + +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`. + +Decision: + +- Reject current `llama-server` MTP as a GB10 serving parity lever. +- Do not enable MTP by default in LocalAI or llama-server. +- Do not tune `spec-draft-n-max` blindly. The regression is large enough that + the next MTP phase, if any, must start with graph/batch-shape profiling. + +Likely root cause: + +- Baseline serving preserved heavy graph reuse (`graphs reused = 361` in the + `n=128` tail). +- MTP serving showed `graphs reused = 1` and high per-slot eval time at high + concurrency. +- The working hypothesis is that MTP verification/draft batch shape churn + defeats the paged decode graph-reuse wins, so extra verification dominates + despite high draft acceptance. + ## Phase 10 GDN C32 Slab Baseline and Source Check Phase 10 starts a separate GDN prefill path; it does not reopen the rejected diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 93d38b969..cffbce06d 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -206,7 +206,7 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher | S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover | | whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall | | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered | -| speculative decode (MTP) | draft + verify | SAFETY-GATED, default-off | Phase 14 passed recurrent rollback, partial rejection, normalized greedy-prefix, and canonical inference gates; still needs serving/API throughput proof before any parity claim | +| speculative decode (MTP) | draft + verify | **REJECTED for current GB10 serving** | Phase 14 safety passed, but Phase 15 serving A/B regressed hard: n128 decode agg 662.4 -> 138.5 tok/s; likely graph/batch-shape disruption (`graphs reused` 361 -> 1) | ### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress - **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off). @@ -225,11 +225,13 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon. -Phase 14 re-validated the MTP bucket as a separate default-off workstream: -rollback and ordinary inference safety are now gated, but speed parity is not -claimed. The serving follow-up must keep the same fixed gates before and after -any benchmark: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 -`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +Phase 14 re-validated the MTP bucket as safe, then Phase 15 rejected it as a +current GB10 serving-throughput lever. Do not enable it by default and do not +keep tuning draft length blindly. The only plausible follow-up is a graph-reuse +and speculative verification batch-shape profile with +`nsys --cuda-graph-trace=node`. The fixed safety gates stayed green before and +after the failed serving A/B: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense +md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. --- diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index bc9ac3a40..b4d8e55f7 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM. | S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS | | whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS | | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS | -| speculative decode (MTP) | draft + verify; greedy is bit-exact | **SAFETY-GATED, default-off** | Phase 14 passed recurrent rollback on the actual MoE GGUF, partial rejection, normalized greedy-prefix, and canonical inference gates; still not a paged-specific gap and still needs serving/API throughput proof before any parity claim | LMAP | +| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; likely breaks paged decode graph reuse (`graphs reused` 361 -> 1). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP | The serving regime was the one place the static-bench parity did not carry over (paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 886eb5939..077fb34a5 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -454,9 +454,9 @@ Phase 9 adds a narrow MTP smoke gate instead of production enablement: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`. -MTP remains opt-in and, after Phase 14, safety-gated but not throughput-proven. -It does not supersede the next GDN prefill scope until a serving phase proves -target-verification cost. +MTP remains opt-in and, after Phase 15, rejected as a current GB10 serving +throughput lever. It does not supersede the GDN/paged-serving conclusions unless +a future graph/batch-shape fix changes the serving result. ### Phase 14 MTP rollback update @@ -478,8 +478,35 @@ than exact transcript md5 because `llama-speculative-simple` emits accepted token groups and can produce a longer completion than `llama-completion -no-cnv` for the same `-n`. Across `n=8,16,24,32,48`, no first differing token was found. -Next step: Phase 15 may benchmark serving/API throughput with MTP still -default-off and only behind the canonical inference gates. +Phase 15 completed that serving/API benchmark and rejected current MTP serving. + +### Phase 15 MTP serving update + +Phase 15 ran the direct `llama-server` serving A/B that Phase 14 enabled. It +rejects current MTP serving as a parity lever on GB10: + +| arm | n | decode agg t/s | decode per-seq t/s | TTFT mean ms | +|---|---:|---:|---:|---:| +| baseline | 8 | 247.8 | 30.70 | 1181.1 | +| MTP | 8 | 109.8 | 14.26 | 1691.5 | +| baseline | 32 | 406.0 | 12.02 | 2762.2 | +| MTP | 32 | 111.7 | 3.61 | 4545.6 | +| baseline | 128 | 662.4 | 4.31 | 7747.2 | +| MTP | 128 | 138.5 | 0.97 | 20385.7 | + +Artifact: `/home/mudler/bench/phase15_mtp_serving/20260701_042005`. + +MTP did draft and accept tokens (`#gen tokens = 17293`, `#acc tokens = 15493`), +so this is not a no-draft false negative. The likely culprit is graph/batch +shape disruption: baseline logs show heavy graph reuse (`graphs reused = 361` +in the high-concurrency tail), while MTP logs show `graphs reused = 1` and much +higher per-slot eval time. Pre/post canonical inference gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Do not keep tuning MTP draft length blindly. A follow-up must first profile +speculative verification batch shapes and CUDA graph reuse with +`nsys --cuda-graph-trace=node`. Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. diff --git a/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh b/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh new file mode 100755 index 000000000..be4ef58b9 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh @@ -0,0 +1,200 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: paged-mtp-serving-bench.sh + +Runs a direct llama-server serving A/B on DGX: + baseline: no speculative decoding + mtp: --spec-type draft-mtp + +Environment overrides: + SRC llama.cpp source dir (default: ~/llama-phase6-source) + BIN binary dir (default: $SRC/build-cuda/bin) + MODEL MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf) + ART artifact dir (default: ~/bench/phase15_mtp_serving/) + PORT server port (default: 8097) + NPL comma/space list of concurrency values (default: "8 32 128") + PTOK prompt filler words for h2h_cli3.py (default: 128) + GEN max generated tokens (default: 128) + CTX server context (default: 131072) + PARALLEL server parallel slots (default: 128) + BATCH server logical batch size (default: 2048) + UBATCH server physical batch size (default: 512) + SKIP_GATES=1 to skip pre/post paged inference gates +EOF +} + +if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then + usage + exit 0 +fi + +SRC=${SRC:-"$HOME/llama-phase6-source"} +BIN=${BIN:-"$SRC/build-cuda/bin"} +MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"} +ART=${ART:-"$HOME/bench/phase15_mtp_serving/$(date +%Y%m%d_%H%M%S)"} +PORT=${PORT:-8097} +NPL=${NPL:-"8 32 128"} +PTOK=${PTOK:-128} +GEN=${GEN:-128} +CTX=${CTX:-131072} +PARALLEL=${PARALLEL:-128} +BATCH=${BATCH:-2048} +UBATCH=${UBATCH:-512} +SKIP_GATES=${SKIP_GATES:-0} + +LOCK_DIR="$HOME/gpu_bench_lock" +OWNER="$LOCK_DIR/owner" +SERVER_PID="" + +log() { + printf '[%s] %s\n' "$(date -Is)" "$*" | tee -a "$ART/run.log" +} + +preflight() { + mkdir -p "$ART" + local docker_count local_ai compute owner + docker_count=$(docker ps -q | wc -l) + local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) + compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed '/^$/d' | wc -l) + owner="FREE-no-lock-file" + if [[ -f "$OWNER" ]]; then + owner=$(cat "$OWNER") + fi + { + echo "docker=$docker_count" + echo "local_ai_worker=$local_ai" + echo "compute=$compute" + echo "$owner" + } | tee "$ART/preflight.txt" + [[ "$docker_count" == "0" ]] + [[ "$local_ai" == "0" ]] + [[ "$compute" == "0" ]] + case "$owner" in + FREE*|FREE-no-lock-file) ;; + *) echo "GPU lock is busy: $owner" >&2; exit 2 ;; + esac +} + +acquire_lock() { + mkdir -p "$LOCK_DIR" + echo "codex-phase15-mtp-serving-bench $(date +%s)" > "$OWNER" +} + +release_lock() { + if [[ -n "$SERVER_PID" ]]; then + kill "$SERVER_PID" >/dev/null 2>&1 || true + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + fi + mkdir -p "$LOCK_DIR" + echo "FREE released-by-codex-phase15-mtp-serving-bench $(date +%s)" > "$OWNER" +} + +wait_server() { + local health="$1" + for _ in $(seq 1 180); do + if curl -fsS "http://127.0.0.1:$PORT/health" > "$health" 2>"$health.err"; then + return 0 + fi + if ! kill -0 "$SERVER_PID" 2>/dev/null; then + return 1 + fi + sleep 1 + done + return 1 +} + +stop_server() { + if [[ -n "$SERVER_PID" ]]; then + kill "$SERVER_PID" >/dev/null 2>&1 || true + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + fi +} + +run_gate() { + local name="$1" + if [[ "$SKIP_GATES" == "1" ]]; then + log "skipping $name inference gate" + return + fi + log "running $name inference gate" + ART="$ART/gate_$name" "$HOME/paged-inference-gates.sh" > "$ART/gate_$name.log" 2>&1 + cat "$ART/gate_$name.log" | tee -a "$ART/run.log" +} + +run_arm() { + local arm="$1" + shift + local arm_dir="$ART/$arm" + mkdir -p "$arm_dir" + log "starting $arm server" + cd "$BIN" + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + ./llama-server \ + -m "$MODEL" -ngl 99 -fa on -c "$CTX" -b "$BATCH" -ub "$UBATCH" \ + --parallel "$PARALLEL" --host 127.0.0.1 --port "$PORT" --no-webui "$@" \ + > "$arm_dir/server.log" 2>&1 & + SERVER_PID=$! + if ! wait_server "$arm_dir/health.json"; then + tail -120 "$arm_dir/server.log" >&2 || true + exit 3 + fi + + for n in $NPL; do + log "running $arm n=$n" + python3 "$HOME/bench/h2h_cli3.py" \ + --url "http://127.0.0.1:$PORT/v1/completions" \ + --model m -n "$n" --ptok "$PTOK" --gen "$GEN" \ + --nonce "${arm}_${n}_$(date +%s)" --no-cache \ + > "$arm_dir/n${n}.json" + cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" + done + + grep -E "draft acceptance|statistics[[:space:]]+draft-mtp|speculative decoding context|bounded partial|backend sampling|common_speculative_impl_draft_mtp" \ + "$arm_dir/server.log" > "$arm_dir/spec_lines.txt" || true + stop_server +} + +preflight + +log "building llama-server and test-backend-ops" +cmake --build "$SRC/build-cuda" --target llama-server test-backend-ops llama-completion -j 8 \ + > "$ART/build.log" 2>&1 + +if [[ ! -x "$HOME/paged-inference-gates.sh" ]]; then + echo "missing $HOME/paged-inference-gates.sh; copy paged-inference-gates.sh there first" >&2 + exit 4 +fi + +run_gate pre +acquire_lock +trap release_lock EXIT +run_arm baseline +run_arm mtp --spec-type draft-mtp --spec-draft-n-max 3 --no-spec-draft-backend-sampling +release_lock +trap - EXIT +run_gate post + +python3 - "$ART" <<'PY' | tee "$ART/summary.tsv" +import json +import sys +from pathlib import Path + +art = Path(sys.argv[1]) +rows = [] +for arm in ("baseline", "mtp"): + for path in sorted((art / arm).glob("n*.json")): + data = json.loads(path.read_text()) + rows.append((arm, data["n"], data["gen_total"], data["agg_tps"], + data["decode_agg_tps"], data["decode_perseq_tps"], + data["ttft_mean_ms"], data["wall_s"])) +print("arm\tn\tgen_total\tagg_tps\tdecode_agg_tps\tdecode_perseq_tps\tttft_mean_ms\twall_s") +for row in rows: + print("\t".join(str(x) for x in row)) +PY + +log "artifacts: $ART" diff --git a/docs/superpowers/plans/2026-07-01-mtp-serving-throughput-phase15.md b/docs/superpowers/plans/2026-07-01-mtp-serving-throughput-phase15.md new file mode 100644 index 000000000..0fa142b99 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-serving-throughput-phase15.md @@ -0,0 +1,191 @@ +# MTP Serving Throughput Phase 15 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:subagent-driven-development or superpowers:executing-plans to +> implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for +> tracking. + +**Goal:** measure whether Phase 14's safe MTP path improves real +`llama-server` serving throughput on GB10. + +**Architecture:** use direct `llama-server` first, not LocalAI, so the benchmark +isolates llama.cpp serving behavior. Compare two same-shape arms: baseline with +no speculative decoding and MTP with `--spec-type draft-mtp`. Run canonical +inference gates before and after the A/B. + +**Tech Stack:** llama.cpp `llama-server`, DGX GB10, `h2h_cli3.py`, +`paged-inference-gates.sh`. + +--- + +## Files + +- Create: `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +## Task 1: Confirm Server MTP Wiring + +- [x] **Step 1: Dispatch independent codebase checks** + + Two explorer agents inspected: + + - llama.cpp server speculative/MTP wiring. + - existing serving benchmark harnesses and safety-gate discipline. + +- [x] **Step 2: Record startup-only control** + + Finding: + + - `llama-server` supports MTP when started with `--spec-type draft-mtp`. + - HTTP request JSON cannot enable speculation per request because the + speculative request fields in `tools/server/server-schema.cpp` are under + `#if 0`. + +- [x] **Step 3: Run a one-request server smoke** + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving_smoke` + + Evidence: + + ```text + common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp' + common_context_can_seq_rm: the context supports bounded partial sequence removal + timings.draft_n = 33 + timings.draft_n_accepted = 19 + ``` + +## Task 2: Add Repeatable DGX Runner + +- [x] **Step 1: Create runner** + + Created: + + - `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh` + + Responsibilities: + + - check docker, `local-ai-worker`, compute PIDs, and GPU lock owner, + - run pre/post `paged-inference-gates.sh`, + - run baseline and MTP `llama-server` arms, + - drive `/v1/completions` with `/home/mudler/bench/h2h_cli3.py`, + - capture server logs, client JSON, MTP acceptance lines, and a summary TSV. + +- [x] **Step 2: Fix lock ordering** + + First attempt stopped before benchmarking because the runner acquired the GPU + lock and then called `paged-inference-gates.sh`, whose own preflight correctly + rejects a non-free lock owner. + + Fix: run the pre-gate before acquiring the benchmark lock and the post-gate + after releasing it. + +## Task 3: Run Serving A/B + +- [x] **Step 1: Run canonical pre-gate** + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving/20260701_042005/gate_pre` + + Result: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + +- [x] **Step 2: Run baseline and MTP arms** + + Command shape: + + ```bash + NPL="8 32 128" PTOK=128 GEN=128 CTX=131072 PARALLEL=128 \ + ~/paged-mtp-serving-bench.sh + ``` + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving/20260701_042005` + + Summary: + + ```text + arm n agg_tps decode_agg_tps decode_perseq_tps ttft_mean_ms wall_s + baseline 8 192.5 247.8 30.70 1181.1 5.318 + mtp 8 92.9 109.8 14.26 1691.5 11.017 + baseline 32 305.4 406.0 12.02 2762.2 13.412 + mtp 32 95.8 111.7 3.61 4545.6 42.727 + baseline 128 429.5 662.4 4.31 7747.2 38.144 + mtp 128 100.3 138.5 0.97 20385.7 163.289 + ``` + +- [x] **Step 3: Confirm MTP actually drafted** + + MTP server log showed: + + ```text + common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00 + statistics draft-mtp: #gen tokens = 17293, #acc tokens = 15493 + ``` + + Acceptance was high enough that this is not a no-draft false negative. + +- [x] **Step 4: Run canonical post-gate** + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving/20260701_042005/gate_post` + + Result: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + +## Task 4: Disposition + +- [x] **Step 1: Reject current MTP serving as a parity lever** + + Current `llama-server` MTP is slower at every tested concurrency: + + - `n=8`: decode aggregate `247.8 -> 109.8` tok/s. + - `n=32`: decode aggregate `406.0 -> 111.7` tok/s. + - `n=128`: decode aggregate `662.4 -> 138.5` tok/s. + +- [x] **Step 2: Record likely root cause** + + Baseline logs show heavy graph reuse in the serving run (`graphs reused = 361` + in the `n=128` tail). MTP logs show `graphs reused = 1` and per-slot eval + around `900-1200 ms/token` at high concurrency. The working hypothesis is that + MTP verification/draft batch shape churn defeats the paged decode graph-reuse + wins, and the extra target verification work dominates despite high acceptance. + +- [x] **Step 3: Scope follow-up** + + Do not continue by tuning `spec-draft-n-max` blindly. The next scoped phase, + if pursued, must first inspect MTP serving graph reuse and batch shapes: + + - confirm whether speculative verification batches bypass the reusable + pure-decode graph key, + - measure with `nsys --cuda-graph-trace=node`, + - test whether MTP can share the default decode graph path or must remain a + non-parity feature on GB10. + +## Self-Review + +- No placeholders remain. +- Phase 15 does not enable MTP by default. +- Phase 15 keeps pre/post md5 and `test-backend-ops` gates. +- Result is a rejected serving-throughput lever, not a parity win.