From 34c4b5ce8d1481a26eb1dab4f9ce3967dae4bb4b Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 30 Jun 2026 23:12:09 +0000 Subject: [PATCH] docs(paged): scope phase7 serving candidates Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates. Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 2 + .../docs/PARITY_HANDOFF.md | 6 + .../docs/VLLM_PARITY_FINAL.md | 5 + .../plans/2026-06-30-serving-nsys-phase6.md | 20 ++- .../plans/2026-06-30-serving-source-phase7.md | 159 ++++++++++++++++++ 5 files changed, 190 insertions(+), 2 deletions(-) create mode 100644 docs/superpowers/plans/2026-06-30-serving-source-phase7.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 9347d912f..a8084b7f1 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -560,3 +560,5 @@ Result: - No current env-only lever clears the serving performance gate. Scope the next source candidate against either structural MoE decode fusion or async serving input/sampler uploads, with a workload that proves the target bucket matters. +- Phase 7 must keep the canonical MoE and dense md5 gates as the first + inference-safety check before any performance result is accepted. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index c47d36865..92c65028b 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -1,5 +1,11 @@ # PARITY_HANDOFF: how to pick up the GB10 vLLM-parity work +> 2026-06-30 update: this handoff is now historical procedure, not the active +> verdict. The GB10 investigation was reopened in `GB10_PARITY_REOPEN_SPEC.md` +> and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and +> the active follow-up plans under `docs/superpowers/plans/`. Use those files for +> the current state before relying on the older "closed" conclusion below. + Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend. This file is the **operational how-to**. It is the companion to `VLLM_PARITY_FINAL.md`, which is the **why / authoritative record** ("never re-litigate"). If the two ever disagree on a *fact*, `VLLM_PARITY_FINAL.md` and the bench artifacts it cites win; this file wins on *procedure* (how to ssh, lock, build, bench, profile). diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md index 28ee15268..b299e8183 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -1,5 +1,10 @@ # vLLM Parity - Final State (Qwen3.6 NVFP4 on GB10) +> 2026-06-30 update: this document records the earlier final-state verdict. The +> investigation has since been reopened; see `GB10_PARITY_REOPEN_SPEC.md`, +> `GB10_PARITY_PHASE0_RESULTS.md`, and the active `docs/superpowers/plans/` +> Phase 6/Phase 7 files for the current measured state and follow-up scope. + > **Status: CLOSED.** This is the standing record of the exhaustive GB10 (DGX > Spark, sm_121) parity investigation for `llama-cpp-localai-paged` against vLLM > on the Qwen3.6 hybrid gated-DeltaNet NVFP4 models. It exists so the diff --git a/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md b/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md index 3163e1707..d63c1092f 100644 --- a/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md +++ b/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md @@ -1,6 +1,6 @@ # Phase 6: Serving nsys Gap Classifier -**Status:** In progress. +**Status:** Completed. Phase 6 kept no source changes. **Scope:** Measurement-first. Do not edit llama.cpp source in this phase unless the serving profiles identify a small, bit-exact, fork-first patch candidate. @@ -22,6 +22,10 @@ measured evidence. - Patch promotion threshold: no semantic gate regression, no generated patch hand-editing, and at least one measured serving bucket improvement that explains a material share of the vLLM gap. +- Inference-safety rule: a candidate that changes CUDA routing, sampler inputs, + graph construction, or MoE kernels is not kept unless the md5 gates are rerun + from the clean candidate binary and still match the canonical values above. + Performance-only evidence is insufficient. ## Checklist @@ -84,7 +88,7 @@ measured evidence. ## Current Decision -W4A16 prefill is no longer the highest-leverage path. The accepted Phase 1-4 +W4A16 prefill was not the highest-leverage path for Phase 6. The accepted Phase 1-4 changes improved forced W4A16 from roughly `1314/1339` to `1466/1495` S_PP, but default FP4-MMQ remains around `2303/2423`. The next evidence gate is serving nsys, because the committed lever map says the residual gap is in real @@ -203,3 +207,15 @@ Shape: `n=128`, `ptok=128`, `gen=64`. Result: rejected as an env-only lever. Existing grouped-MMQ tile knobs do not materially close the serving gap, so a selector-only source patch is not justified. + +## Completion + +Phase 6 completed as a classifier, not as a source patch phase: + +- Accepted source patches before Phase 6 remained intact through fork head + `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`. +- The sampler short-circuit candidate passed inference gates but failed the + serving performance gate, so it was reverted and not mirrored. +- GDN and grouped-MMQ env grids did not clear the material-improvement threshold. +- No LocalAI patch was generated for Phase 6. The next phase must start from a + clean fork and keep the same md5/op gates before any source candidate is kept. diff --git a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md new file mode 100644 index 000000000..26dfeea54 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md @@ -0,0 +1,159 @@ +# Phase 7: Serving Source Candidate Scope + +**Status:** Scoped. Code implementation not started. + +**Goal:** Select one maintainable source candidate for the remaining GB10 MoE +serving gap, then implement only if it can be gated for inference correctness and +measured against a bucket that Phase 6 proved relevant. + +## Entry State + +- llama.cpp fork: `/home/mudler/_git/llama.cpp` +- Required branch: `localai-paged` +- Required clean head: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` +- LocalAI patch mirror count before Phase 7: `41`, through patch `0050` +- DGX mirror used by Phase 6: `/home/mudler/llama-phase6-source` + +## Required Safety Gates + +- Before DGX work: + - `docker ps -q | wc -l` must be `0`. + - `nvidia-smi --query-compute-apps=pid --format=csv,noheader` must be empty. + - `~/gpu_bench_lock/owner` must be absent or start with `FREE`. + - No `local-ai-worker` container may be running. +- Before keeping any source patch: + - MoE greedy md5 must be `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense greedy md5 must be `5951a5b4d624ce891e22ab5fca9bc439`. + - If W4A16 is touched, forced `bm32` and `base` md5 must both be + `07db32c2bcb78d17a43ed18bc22705cd`. + - If `MUL_MAT_ID` routing or CUDA MoE kernels are touched, run + `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`. +- Patch handling: + - Source changes are fork-first in `/home/mudler/_git/llama.cpp`. + - Keep each patch incremental and additive, with helper functions preferred + over invasive rewrites. + - Regenerate LocalAI patches with `git format-patch`; do not hand-edit + generated patch files. + +## Candidate Tracks + +### Track A: Structural MoE Decode Kernel + +Phase 6 evidence: grouped NVFP4 `mul_mat_q` accounts for roughly 30% of llama.cpp +GPU kernel time under serving, while vLLM's Marlin-MoE bucket is materially +smaller in the same workload class. + +The candidate must identify a bounded change in the current `MUL_MAT_ID` or +grouped-MMQ path that reduces actual serving bucket time. Selector-only tile +retuning is rejected unless new evidence differs from the Phase 6 MMQ grid. + +Selected first candidate: + +- Add a batched CUDA path that fuses MoE SWIGLU with the NVFP4 activation + quantization feeding the **down** `MUL_MAT_ID`. +- Current graph shape: + `ffn_moe_gate_up` `MUL_MAT_ID` -> gate/up views -> `ggml_swiglu_split` -> + `ffn_moe_down` `MUL_MAT_ID`. +- Target: remove or reduce the separate f32 SWIGLU intermediate write/read and + `quantize_mmq_nvfp4` pass for the down projection while preserving the existing + grouped-MMQ kernel and accumulation order. +- Keep scope to CUDA, Blackwell native FP4, `GGML_TYPE_NVFP4`, merged gate/up + MoE, down projection only, no bias/clamp/OAI/GEGLU. + +Important finding: + +- Existing CUDA `MUL_MAT_ID + GLU` fusion is vector-only. The fusion predicates + reject `MUL_MAT_ID` when `dst->ne[2] != 1`, so it does not cover the Phase 6 + multi-token serving shape. +- Existing `MUL_MAT_ID_FUSION` tests cover add/mul after `MUL_MAT_ID`, not the + gate_up/SWIGLU/down chain. Do not treat them as sufficient for this candidate. + +Initial files to inspect: + +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` +- vLLM Marlin-MoE implementation files in the local vLLM checkout/package. + +### Track B: Serving Input And Sampler Synchronization + +Phase 6 evidence: `cudaStreamSynchronize` dominates CUDA API time, and many +syncs follow small `cudaMemcpyAsync` calls. The greedy sampler short-circuit +passed correctness gates but did not improve serving, so this track needs a +workload where sampler/input upload cost is proven relevant before patching. + +Initial files to inspect: + +- `/home/mudler/_git/llama.cpp/src/llama-sampling.cpp` +- `/home/mudler/_git/llama.cpp/src/llama-context.cpp` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-backend.cpp` +- CUDA backend tensor-set paths under `/home/mudler/_git/llama.cpp/ggml/src/`. + +Selected secondary candidate: + +- Cache backend logit-bias tensor uploads in + `/home/mudler/_git/llama.cpp/src/llama-sampler.cpp` + `llama_sampler_logit_bias_backend_set_input()`. +- Today the sampler rebuilds and uploads `logit_bias` and `logit_idxs` every + decode step. Those uploads hit the CUDA tensor-set path with immediate + `cudaStreamSynchronize`. +- This is narrow and maintainable, but it is not the default greedy parity + lever. Only promote it if a non-greedy backend-sampling workload with non-empty + `logit_bias` proves the sync bucket is material. + +Required workload: + +- Include a non-greedy serving shape if the patch targets sampler randomness or + probability upload behavior. +- Preserve the canonical greedy md5 gates even if the optimization targets + non-greedy serving. + +## Decision Gate + +Only one track may enter implementation at a time. Promote a candidate from scope +to implementation when all are true: + +- It has an exact file/function target. +- It is additive enough to minimize upstream conflicts. +- It has a direct measurement bucket from Phase 6 or a fresh bounded profile. +- It has a clear rollback path. +- It passes the md5/op gates before any performance result is accepted. + +## Checklist + +- [x] Close remaining Phase 6 explorer agents or capture their final findings. +- [x] Reconfirm DGX idle state before any new benchmark. + - Docker containers: `0`. + - `local-ai-worker`: `0`. + - Compute PIDs: `0`. + - Lock: `FREE released-by-codex-phase6-mmq-grid 1782860601`. +- [x] Pick Track A or Track B from concrete code evidence. + - Primary: Track A, batched MoE SWIGLU -> NVFP4 down-input quantization. + - Secondary: Track B, backend logit-bias upload cache for non-greedy workloads. +- [ ] Run baseline gates from the clean candidate build. +- [ ] Implement one fork-first incremental patch. +- [ ] Run md5/op gates before serving A/B. +- [ ] Keep only if the serving bucket and h2h result improve materially. +- [ ] Regenerate LocalAI patch stack and update docs if kept. + +## Required Tests Before Track A Source Patch + +- Add or extend a whole-graph op test for the batched MoE gate_up/SWIGLU/down + chain. Shapes must include `type_a=NVFP4`, `n_mats=128`, `n_used=8`, + `m=768`, `k=2048`, and `n in {16, 33, 64, 128, 130, 200}`. +- Run `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806` + until a more specific op name is available. +- Run canonical MoE and dense greedy md5 gates before serving A/B: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- Run a mixed prompt/decode md5 gate (`ptok=512`, `gen=32`) because graph reuse + can hide bugs that a decode-only gate misses. + +## Required Tests Before Track B Source Patch + +- Establish fixed-seed baseline output md5 and token-id parity for a + backend-sampling request with non-empty `logit_bias`. +- Include the canonical greedy MoE and dense md5 gates even though the workload + target is non-greedy. +- Run existing server completion tests covering backend sampling probabilities + and logit-bias behavior.