docs(paged): reject MTP serving lever

Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates. Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 02:29:28 +00:00
parent 70394364a3
commit 4d171e62bb
7 changed files with 497 additions and 12 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -344,6 +344,12 @@ md5 checks and selected `test-backend-ops` filters, and refuses to start while
 docker, `local-ai-worker`, GPU compute processes, or a non-free GPU lock are
 present.

+For direct `llama-server` MTP serving A/B work, use
+`paged-mtp-serving-bench.sh`. It runs the same pre/post inference gates, compares
+baseline vs `--spec-type draft-mtp`, and captures the h2h client summaries plus
+MTP acceptance lines. Phase 15 rejected current MTP serving on GB10 despite
+passing safety gates; do not enable it by default.
+
 **The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md)).
 Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
 (`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -885,6 +885,65 @@ Decision:
 - Do not count MTP as a GB10 speed-parity win until serving results show useful
  target-verification throughput under the canonical inference gates.

+## Phase 15 MTP Serving Throughput Gate
+
+Phase 15 measured the direct `llama-server` serving path after Phase 14 proved
+rollback safety. The test compared two same-shape arms:
+
+- baseline: no speculative decoding,
+- MTP: `--spec-type draft-mtp --spec-draft-n-max 3
+  --no-spec-draft-backend-sampling`.
+
+Artifact:
+
+- `/home/mudler/bench/phase15_mtp_serving/20260701_042005`
+
+Harness:
+
+- `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh`
+- `NPL="8 32 128" PTOK=128 GEN=128 CTX=131072 PARALLEL=128`
+- client: `/home/mudler/bench/h2h_cli3.py` against `/v1/completions`
+
+Result:
+
+| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | TTFT mean ms | wall s |
+|---|---:|---:|---:|---:|---:|---:|
+| baseline | 8 | 192.5 | 247.8 | 30.70 | 1181.1 | 5.318 |
+| MTP | 8 | 92.9 | 109.8 | 14.26 | 1691.5 | 11.017 |
+| baseline | 32 | 305.4 | 406.0 | 12.02 | 2762.2 | 13.412 |
+| MTP | 32 | 95.8 | 111.7 | 3.61 | 4545.6 | 42.727 |
+| baseline | 128 | 429.5 | 662.4 | 4.31 | 7747.2 | 38.144 |
+| MTP | 128 | 100.3 | 138.5 | 0.97 | 20385.7 | 163.289 |
+
+MTP did actually run:
+
+- server initialized `draft-mtp` with bounded partial sequence removal,
+- response/server timings included draft counters,
+- server log tail included `#gen tokens = 17293`, `#acc tokens = 15493`.
+
+Normal inference gates before and after the A/B:
+
+- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
+- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
+- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`.
+
+Decision:
+
+- Reject current `llama-server` MTP as a GB10 serving parity lever.
+- Do not enable MTP by default in LocalAI or llama-server.
+- Do not tune `spec-draft-n-max` blindly. The regression is large enough that
+  the next MTP phase, if any, must start with graph/batch-shape profiling.
+
+Likely root cause:
+
+- Baseline serving preserved heavy graph reuse (`graphs reused = 361` in the
+  `n=128` tail).
+- MTP serving showed `graphs reused = 1` and high per-slot eval time at high
+  concurrency.
+- The working hypothesis is that MTP verification/draft batch shape churn
+  defeats the paged decode graph-reuse wins, so extra verification dominates
+  despite high draft acceptance.
+
 ## Phase 10 GDN C32 Slab Baseline and Source Check

 Phase 10 starts a separate GDN prefill path; it does not reopen the rejected
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -206,7 +206,7 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher
 | S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover |
 | whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall |
 | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered |
-| speculative decode (MTP) | draft + verify | SAFETY-GATED, default-off | Phase 14 passed recurrent rollback, partial rejection, normalized greedy-prefix, and canonical inference gates; still needs serving/API throughput proof before any parity claim |
+| speculative decode (MTP) | draft + verify | **REJECTED for current GB10 serving** | Phase 14 safety passed, but Phase 15 serving A/B regressed hard: n128 decode agg 662.4 -> 138.5 tok/s; likely graph/batch-shape disruption (`graphs reused` 361 -> 1) |

 ### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress
 - **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off).
@@ -225,11 +225,13 @@ Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point wher

 The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon.

-Phase 14 re-validated the MTP bucket as a separate default-off workstream:
-rollback and ordinary inference safety are now gated, but speed parity is not
-claimed. The serving follow-up must keep the same fixed gates before and after
-any benchmark: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
-`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+Phase 14 re-validated the MTP bucket as safe, then Phase 15 rejected it as a
+current GB10 serving-throughput lever. Do not enable it by default and do not
+keep tuning draft length blindly. The only plausible follow-up is a graph-reuse
+and speculative verification batch-shape profile with
+`nsys --cuda-graph-trace=node`. The fixed safety gates stayed green before and
+after the failed serving A/B: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense
+md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.

 ---

--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
@@ -281,7 +281,7 @@ operating point where paged is unambiguously faster than vLLM.
 | S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS |
 | whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS |
 | padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS |
-| speculative decode (MTP) | draft + verify; greedy is bit-exact | **SAFETY-GATED, default-off** | Phase 14 passed recurrent rollback on the actual MoE GGUF, partial rejection, normalized greedy-prefix, and canonical inference gates; still not a paged-specific gap and still needs serving/API throughput proof before any parity claim | LMAP |
+| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; likely breaks paged decode graph reuse (`graphs reused` 361 -> 1). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP |

 The serving regime was the one place the static-bench parity did not carry over
 (paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -454,9 +454,9 @@ Phase 9 adds a narrow MTP smoke gate instead of production enablement:
  MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
  `5951a5b4d624ce891e22ab5fca9bc439`.

-MTP remains opt-in and, after Phase 14, safety-gated but not throughput-proven.
-It does not supersede the next GDN prefill scope until a serving phase proves
-target-verification cost.
+MTP remains opt-in and, after Phase 15, rejected as a current GB10 serving
+throughput lever. It does not supersede the GDN/paged-serving conclusions unless
+a future graph/batch-shape fix changes the serving result.

 ### Phase 14 MTP rollback update

@@ -478,8 +478,35 @@ than exact transcript md5 because `llama-speculative-simple` emits accepted
 token groups and can produce a longer completion than `llama-completion -no-cnv`
 for the same `-n`. Across `n=8,16,24,32,48`, no first differing token was found.

-Next step: Phase 15 may benchmark serving/API throughput with MTP still
-default-off and only behind the canonical inference gates.
+Phase 15 completed that serving/API benchmark and rejected current MTP serving.
+
+### Phase 15 MTP serving update
+
+Phase 15 ran the direct `llama-server` serving A/B that Phase 14 enabled. It
+rejects current MTP serving as a parity lever on GB10:
+
+| arm | n | decode agg t/s | decode per-seq t/s | TTFT mean ms |
+|---|---:|---:|---:|---:|
+| baseline | 8 | 247.8 | 30.70 | 1181.1 |
+| MTP | 8 | 109.8 | 14.26 | 1691.5 |
+| baseline | 32 | 406.0 | 12.02 | 2762.2 |
+| MTP | 32 | 111.7 | 3.61 | 4545.6 |
+| baseline | 128 | 662.4 | 4.31 | 7747.2 |
+| MTP | 128 | 138.5 | 0.97 | 20385.7 |
+
+Artifact: `/home/mudler/bench/phase15_mtp_serving/20260701_042005`.
+
+MTP did draft and accept tokens (`#gen tokens = 17293`, `#acc tokens = 15493`),
+so this is not a no-draft false negative. The likely culprit is graph/batch
+shape disruption: baseline logs show heavy graph reuse (`graphs reused = 361`
+in the high-concurrency tail), while MTP logs show `graphs reused = 1` and much
+higher per-slot eval time. Pre/post canonical inference gates stayed green:
+MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
+
+Do not keep tuning MTP draft length blindly. A follow-up must first profile
+speculative verification batch shapes and CUDA graph reuse with
+`nsys --cuda-graph-trace=node`.

 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

--- a/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh
+++ b/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh
@@ -0,0 +1,200 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+usage() {
+  cat <<'EOF'
+Usage: paged-mtp-serving-bench.sh
+
+Runs a direct llama-server serving A/B on DGX:
+  baseline: no speculative decoding
+  mtp:      --spec-type draft-mtp
+
+Environment overrides:
+  SRC       llama.cpp source dir (default: ~/llama-phase6-source)
+  BIN       binary dir (default: $SRC/build-cuda/bin)
+  MODEL     MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
+  ART       artifact dir (default: ~/bench/phase15_mtp_serving/<timestamp>)
+  PORT      server port (default: 8097)
+  NPL       comma/space list of concurrency values (default: "8 32 128")
+  PTOK      prompt filler words for h2h_cli3.py (default: 128)
+  GEN       max generated tokens (default: 128)
+  CTX       server context (default: 131072)
+  PARALLEL  server parallel slots (default: 128)
+  BATCH     server logical batch size (default: 2048)
+  UBATCH    server physical batch size (default: 512)
+  SKIP_GATES=1 to skip pre/post paged inference gates
+EOF
+}
+
+if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+  usage
+  exit 0
+fi
+
+SRC=${SRC:-"$HOME/llama-phase6-source"}
+BIN=${BIN:-"$SRC/build-cuda/bin"}
+MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
+ART=${ART:-"$HOME/bench/phase15_mtp_serving/$(date +%Y%m%d_%H%M%S)"}
+PORT=${PORT:-8097}
+NPL=${NPL:-"8 32 128"}
+PTOK=${PTOK:-128}
+GEN=${GEN:-128}
+CTX=${CTX:-131072}
+PARALLEL=${PARALLEL:-128}
+BATCH=${BATCH:-2048}
+UBATCH=${UBATCH:-512}
+SKIP_GATES=${SKIP_GATES:-0}
+
+LOCK_DIR="$HOME/gpu_bench_lock"
+OWNER="$LOCK_DIR/owner"
+SERVER_PID=""
+
+log() {
+  printf '[%s] %s\n' "$(date -Is)" "$*" | tee -a "$ART/run.log"
+}
+
+preflight() {
+  mkdir -p "$ART"
+  local docker_count local_ai compute owner
+  docker_count=$(docker ps -q | wc -l)
+  local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true)
+  compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed '/^$/d' | wc -l)
+  owner="FREE-no-lock-file"
+  if [[ -f "$OWNER" ]]; then
+    owner=$(cat "$OWNER")
+  fi
+  {
+    echo "docker=$docker_count"
+    echo "local_ai_worker=$local_ai"
+    echo "compute=$compute"
+    echo "$owner"
+  } | tee "$ART/preflight.txt"
+  [[ "$docker_count" == "0" ]]
+  [[ "$local_ai" == "0" ]]
+  [[ "$compute" == "0" ]]
+  case "$owner" in
+    FREE*|FREE-no-lock-file) ;;
+    *) echo "GPU lock is busy: $owner" >&2; exit 2 ;;
+  esac
+}
+
+acquire_lock() {
+  mkdir -p "$LOCK_DIR"
+  echo "codex-phase15-mtp-serving-bench $(date +%s)" > "$OWNER"
+}
+
+release_lock() {
+  if [[ -n "$SERVER_PID" ]]; then
+    kill "$SERVER_PID" >/dev/null 2>&1 || true
+    wait "$SERVER_PID" >/dev/null 2>&1 || true
+    SERVER_PID=""
+  fi
+  mkdir -p "$LOCK_DIR"
+  echo "FREE released-by-codex-phase15-mtp-serving-bench $(date +%s)" > "$OWNER"
+}
+
+wait_server() {
+  local health="$1"
+  for _ in $(seq 1 180); do
+    if curl -fsS "http://127.0.0.1:$PORT/health" > "$health" 2>"$health.err"; then
+      return 0
+    fi
+    if ! kill -0 "$SERVER_PID" 2>/dev/null; then
+      return 1
+    fi
+    sleep 1
+  done
+  return 1
+}
+
+stop_server() {
+  if [[ -n "$SERVER_PID" ]]; then
+    kill "$SERVER_PID" >/dev/null 2>&1 || true
+    wait "$SERVER_PID" >/dev/null 2>&1 || true
+    SERVER_PID=""
+  fi
+}
+
+run_gate() {
+  local name="$1"
+  if [[ "$SKIP_GATES" == "1" ]]; then
+    log "skipping $name inference gate"
+    return
+  fi
+  log "running $name inference gate"
+  ART="$ART/gate_$name" "$HOME/paged-inference-gates.sh" > "$ART/gate_$name.log" 2>&1
+  cat "$ART/gate_$name.log" | tee -a "$ART/run.log"
+}
+
+run_arm() {
+  local arm="$1"
+  shift
+  local arm_dir="$ART/$arm"
+  mkdir -p "$arm_dir"
+  log "starting $arm server"
+  cd "$BIN"
+  env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \
+    ./llama-server \
+      -m "$MODEL" -ngl 99 -fa on -c "$CTX" -b "$BATCH" -ub "$UBATCH" \
+      --parallel "$PARALLEL" --host 127.0.0.1 --port "$PORT" --no-webui "$@" \
+      > "$arm_dir/server.log" 2>&1 &
+  SERVER_PID=$!
+  if ! wait_server "$arm_dir/health.json"; then
+    tail -120 "$arm_dir/server.log" >&2 || true
+    exit 3
+  fi
+
+  for n in $NPL; do
+    log "running $arm n=$n"
+    python3 "$HOME/bench/h2h_cli3.py" \
+      --url "http://127.0.0.1:$PORT/v1/completions" \
+      --model m -n "$n" --ptok "$PTOK" --gen "$GEN" \
+      --nonce "${arm}_${n}_$(date +%s)" --no-cache \
+      > "$arm_dir/n${n}.json"
+    cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log"
+  done
+
+  grep -E "draft acceptance|statistics[[:space:]]+draft-mtp|speculative decoding context|bounded partial|backend sampling|common_speculative_impl_draft_mtp" \
+    "$arm_dir/server.log" > "$arm_dir/spec_lines.txt" || true
+  stop_server
+}
+
+preflight
+
+log "building llama-server and test-backend-ops"
+cmake --build "$SRC/build-cuda" --target llama-server test-backend-ops llama-completion -j 8 \
+  > "$ART/build.log" 2>&1
+
+if [[ ! -x "$HOME/paged-inference-gates.sh" ]]; then
+  echo "missing $HOME/paged-inference-gates.sh; copy paged-inference-gates.sh there first" >&2
+  exit 4
+fi
+
+run_gate pre
+acquire_lock
+trap release_lock EXIT
+run_arm baseline
+run_arm mtp --spec-type draft-mtp --spec-draft-n-max 3 --no-spec-draft-backend-sampling
+release_lock
+trap - EXIT
+run_gate post
+
+python3 - "$ART" <<'PY' | tee "$ART/summary.tsv"
+import json
+import sys
+from pathlib import Path
+
+art = Path(sys.argv[1])
+rows = []
+for arm in ("baseline", "mtp"):
+    for path in sorted((art / arm).glob("n*.json")):
+        data = json.loads(path.read_text())
+        rows.append((arm, data["n"], data["gen_total"], data["agg_tps"],
+                     data["decode_agg_tps"], data["decode_perseq_tps"],
+                     data["ttft_mean_ms"], data["wall_s"]))
+print("arm\tn\tgen_total\tagg_tps\tdecode_agg_tps\tdecode_perseq_tps\tttft_mean_ms\twall_s")
+for row in rows:
+    print("\t".join(str(x) for x in row))
+PY
+
+log "artifacts: $ART"