docs(paged): scope phase7 serving candidates

Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates. Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-06-30 23:12:09 +00:00
parent b647460dee
commit 34c4b5ce8d
5 changed files with 190 additions and 2 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -560,3 +560,5 @@ Result:
 - No current env-only lever clears the serving performance gate. Scope the next
  source candidate against either structural MoE decode fusion or async serving
  input/sampler uploads, with a workload that proves the target bucket matters.
+- Phase 7 must keep the canonical MoE and dense md5 gates as the first
+  inference-safety check before any performance result is accepted.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -1,5 +1,11 @@
 # PARITY_HANDOFF: how to pick up the GB10 vLLM-parity work

+> 2026-06-30 update: this handoff is now historical procedure, not the active
+> verdict. The GB10 investigation was reopened in `GB10_PARITY_REOPEN_SPEC.md`
+> and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and
+> the active follow-up plans under `docs/superpowers/plans/`. Use those files for
+> the current state before relying on the older "closed" conclusion below.
+
 Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend.

 This file is the **operational how-to**. It is the companion to `VLLM_PARITY_FINAL.md`, which is the **why / authoritative record** ("never re-litigate"). If the two ever disagree on a *fact*, `VLLM_PARITY_FINAL.md` and the bench artifacts it cites win; this file wins on *procedure* (how to ssh, lock, build, bench, profile).
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md
@@ -1,5 +1,10 @@
 # vLLM Parity - Final State (Qwen3.6 NVFP4 on GB10)

+> 2026-06-30 update: this document records the earlier final-state verdict. The
+> investigation has since been reopened; see `GB10_PARITY_REOPEN_SPEC.md`,
+> `GB10_PARITY_PHASE0_RESULTS.md`, and the active `docs/superpowers/plans/`
+> Phase 6/Phase 7 files for the current measured state and follow-up scope.
+
 > **Status: CLOSED.** This is the standing record of the exhaustive GB10 (DGX
 > Spark, sm_121) parity investigation for `llama-cpp-localai-paged` against vLLM
 > on the Qwen3.6 hybrid gated-DeltaNet NVFP4 models. It exists so the
--- a/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md
+++ b/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md
@@ -1,6 +1,6 @@
 # Phase 6: Serving nsys Gap Classifier

-**Status:** In progress.
+**Status:** Completed. Phase 6 kept no source changes.

 **Scope:** Measurement-first. Do not edit llama.cpp source in this phase unless
 the serving profiles identify a small, bit-exact, fork-first patch candidate.
@@ -22,6 +22,10 @@ measured evidence.
 - Patch promotion threshold: no semantic gate regression, no generated patch
  hand-editing, and at least one measured serving bucket improvement that explains
  a material share of the vLLM gap.
+- Inference-safety rule: a candidate that changes CUDA routing, sampler inputs,
+  graph construction, or MoE kernels is not kept unless the md5 gates are rerun
+  from the clean candidate binary and still match the canonical values above.
+  Performance-only evidence is insufficient.

 ## Checklist

@@ -84,7 +88,7 @@ measured evidence.

 ## Current Decision

-W4A16 prefill is no longer the highest-leverage path. The accepted Phase 1-4
+W4A16 prefill was not the highest-leverage path for Phase 6. The accepted Phase 1-4
 changes improved forced W4A16 from roughly `1314/1339` to `1466/1495` S_PP, but
 default FP4-MMQ remains around `2303/2423`. The next evidence gate is serving
 nsys, because the committed lever map says the residual gap is in real
@@ -203,3 +207,15 @@ Shape: `n=128`, `ptok=128`, `gen=64`.
 Result: rejected as an env-only lever. Existing grouped-MMQ tile knobs do not
 materially close the serving gap, so a selector-only source patch is not
 justified.
+
+## Completion
+
+Phase 6 completed as a classifier, not as a source patch phase:
+
+- Accepted source patches before Phase 6 remained intact through fork head
+  `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`.
+- The sampler short-circuit candidate passed inference gates but failed the
+  serving performance gate, so it was reverted and not mirrored.
+- GDN and grouped-MMQ env grids did not clear the material-improvement threshold.
+- No LocalAI patch was generated for Phase 6. The next phase must start from a
+  clean fork and keep the same md5/op gates before any source candidate is kept.
--- a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md
+++ b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md
@@ -0,0 +1,159 @@
+# Phase 7: Serving Source Candidate Scope
+
+**Status:** Scoped. Code implementation not started.
+
+**Goal:** Select one maintainable source candidate for the remaining GB10 MoE
+serving gap, then implement only if it can be gated for inference correctness and
+measured against a bucket that Phase 6 proved relevant.
+
+## Entry State
+
+- llama.cpp fork: `/home/mudler/_git/llama.cpp`
+- Required branch: `localai-paged`
+- Required clean head: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
+- LocalAI patch mirror count before Phase 7: `41`, through patch `0050`
+- DGX mirror used by Phase 6: `/home/mudler/llama-phase6-source`
+
+## Required Safety Gates
+
+- Before DGX work:
+  - `docker ps -q | wc -l` must be `0`.
+  - `nvidia-smi --query-compute-apps=pid --format=csv,noheader` must be empty.
+  - `~/gpu_bench_lock/owner` must be absent or start with `FREE`.
+  - No `local-ai-worker` container may be running.
+- Before keeping any source patch:
+  - MoE greedy md5 must be `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense greedy md5 must be `5951a5b4d624ce891e22ab5fca9bc439`.
+  - If W4A16 is touched, forced `bm32` and `base` md5 must both be
+    `07db32c2bcb78d17a43ed18bc22705cd`.
+  - If `MUL_MAT_ID` routing or CUDA MoE kernels are touched, run
+    `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`.
+- Patch handling:
+  - Source changes are fork-first in `/home/mudler/_git/llama.cpp`.
+  - Keep each patch incremental and additive, with helper functions preferred
+    over invasive rewrites.
+  - Regenerate LocalAI patches with `git format-patch`; do not hand-edit
+    generated patch files.
+
+## Candidate Tracks
+
+### Track A: Structural MoE Decode Kernel
+
+Phase 6 evidence: grouped NVFP4 `mul_mat_q` accounts for roughly 30% of llama.cpp
+GPU kernel time under serving, while vLLM's Marlin-MoE bucket is materially
+smaller in the same workload class.
+
+The candidate must identify a bounded change in the current `MUL_MAT_ID` or
+grouped-MMQ path that reduces actual serving bucket time. Selector-only tile
+retuning is rejected unless new evidence differs from the Phase 6 MMQ grid.
+
+Selected first candidate:
+
+- Add a batched CUDA path that fuses MoE SWIGLU with the NVFP4 activation
+  quantization feeding the **down** `MUL_MAT_ID`.
+- Current graph shape:
+  `ffn_moe_gate_up` `MUL_MAT_ID` -> gate/up views -> `ggml_swiglu_split` ->
+  `ffn_moe_down` `MUL_MAT_ID`.
+- Target: remove or reduce the separate f32 SWIGLU intermediate write/read and
+  `quantize_mmq_nvfp4` pass for the down projection while preserving the existing
+  grouped-MMQ kernel and accumulation order.
+- Keep scope to CUDA, Blackwell native FP4, `GGML_TYPE_NVFP4`, merged gate/up
+  MoE, down projection only, no bias/clamp/OAI/GEGLU.
+
+Important finding:
+
+- Existing CUDA `MUL_MAT_ID + GLU` fusion is vector-only. The fusion predicates
+  reject `MUL_MAT_ID` when `dst->ne[2] != 1`, so it does not cover the Phase 6
+  multi-token serving shape.
+- Existing `MUL_MAT_ID_FUSION` tests cover add/mul after `MUL_MAT_ID`, not the
+  gate_up/SWIGLU/down chain. Do not treat them as sufficient for this candidate.
+
+Initial files to inspect:
+
+- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu`
+- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu`
+- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu`
+- vLLM Marlin-MoE implementation files in the local vLLM checkout/package.
+
+### Track B: Serving Input And Sampler Synchronization
+
+Phase 6 evidence: `cudaStreamSynchronize` dominates CUDA API time, and many
+syncs follow small `cudaMemcpyAsync` calls. The greedy sampler short-circuit
+passed correctness gates but did not improve serving, so this track needs a
+workload where sampler/input upload cost is proven relevant before patching.
+
+Initial files to inspect:
+
+- `/home/mudler/_git/llama.cpp/src/llama-sampling.cpp`
+- `/home/mudler/_git/llama.cpp/src/llama-context.cpp`
+- `/home/mudler/_git/llama.cpp/ggml/src/ggml-backend.cpp`
+- CUDA backend tensor-set paths under `/home/mudler/_git/llama.cpp/ggml/src/`.
+
+Selected secondary candidate:
+
+- Cache backend logit-bias tensor uploads in
+  `/home/mudler/_git/llama.cpp/src/llama-sampler.cpp`
+  `llama_sampler_logit_bias_backend_set_input()`.
+- Today the sampler rebuilds and uploads `logit_bias` and `logit_idxs` every
+  decode step. Those uploads hit the CUDA tensor-set path with immediate
+  `cudaStreamSynchronize`.
+- This is narrow and maintainable, but it is not the default greedy parity
+  lever. Only promote it if a non-greedy backend-sampling workload with non-empty
+  `logit_bias` proves the sync bucket is material.
+
+Required workload:
+
+- Include a non-greedy serving shape if the patch targets sampler randomness or
+  probability upload behavior.
+- Preserve the canonical greedy md5 gates even if the optimization targets
+  non-greedy serving.
+
+## Decision Gate
+
+Only one track may enter implementation at a time. Promote a candidate from scope
+to implementation when all are true:
+
+- It has an exact file/function target.
+- It is additive enough to minimize upstream conflicts.
+- It has a direct measurement bucket from Phase 6 or a fresh bounded profile.
+- It has a clear rollback path.
+- It passes the md5/op gates before any performance result is accepted.
+
+## Checklist
+
+- [x] Close remaining Phase 6 explorer agents or capture their final findings.
+- [x] Reconfirm DGX idle state before any new benchmark.
+  - Docker containers: `0`.
+  - `local-ai-worker`: `0`.
+  - Compute PIDs: `0`.
+  - Lock: `FREE released-by-codex-phase6-mmq-grid 1782860601`.
+- [x] Pick Track A or Track B from concrete code evidence.
+  - Primary: Track A, batched MoE SWIGLU -> NVFP4 down-input quantization.
+  - Secondary: Track B, backend logit-bias upload cache for non-greedy workloads.
+- [ ] Run baseline gates from the clean candidate build.
+- [ ] Implement one fork-first incremental patch.
+- [ ] Run md5/op gates before serving A/B.
+- [ ] Keep only if the serving bucket and h2h result improve materially.
+- [ ] Regenerate LocalAI patch stack and update docs if kept.
+
+## Required Tests Before Track A Source Patch
+
+- Add or extend a whole-graph op test for the batched MoE gate_up/SWIGLU/down
+  chain. Shapes must include `type_a=NVFP4`, `n_mats=128`, `n_used=8`,
+  `m=768`, `k=2048`, and `n in {16, 33, 64, 128, 130, 200}`.
+- Run `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`
+  until a more specific op name is available.
+- Run canonical MoE and dense greedy md5 gates before serving A/B:
+  - MoE `8cb0ce23777bf55f92f63d0292c756b0`.
+  - Dense `5951a5b4d624ce891e22ab5fca9bc439`.
+- Run a mixed prompt/decode md5 gate (`ptok=512`, `gen=32`) because graph reuse
+  can hide bugs that a decode-only gate misses.
+
+## Required Tests Before Track B Source Patch
+
+- Establish fixed-seed baseline output md5 and token-id parity for a
+  backend-sampling request with non-empty `logit_bias`.
+- Include the canonical greedy MoE and dense md5 gates even though the workload
+  target is non-greedy.
+- Run existing server completion tests covering backend sampling probabilities
+  and logit-bias behavior.