docs(paged): scope phase7 serving candidates

Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-06-30 23:12:09 +00:00
parent b647460dee
commit 34c4b5ce8d
5 changed files with 190 additions and 2 deletions

View File

@@ -560,3 +560,5 @@ Result:
- No current env-only lever clears the serving performance gate. Scope the next
source candidate against either structural MoE decode fusion or async serving
input/sampler uploads, with a workload that proves the target bucket matters.
- Phase 7 must keep the canonical MoE and dense md5 gates as the first
inference-safety check before any performance result is accepted.

View File

@@ -1,5 +1,11 @@
# PARITY_HANDOFF: how to pick up the GB10 vLLM-parity work
> 2026-06-30 update: this handoff is now historical procedure, not the active
> verdict. The GB10 investigation was reopened in `GB10_PARITY_REOPEN_SPEC.md`
> and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and
> the active follow-up plans under `docs/superpowers/plans/`. Use those files for
> the current state before relying on the older "closed" conclusion below.
Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend.
This file is the **operational how-to**. It is the companion to `VLLM_PARITY_FINAL.md`, which is the **why / authoritative record** ("never re-litigate"). If the two ever disagree on a *fact*, `VLLM_PARITY_FINAL.md` and the bench artifacts it cites win; this file wins on *procedure* (how to ssh, lock, build, bench, profile).

View File

@@ -1,5 +1,10 @@
# vLLM Parity - Final State (Qwen3.6 NVFP4 on GB10)
> 2026-06-30 update: this document records the earlier final-state verdict. The
> investigation has since been reopened; see `GB10_PARITY_REOPEN_SPEC.md`,
> `GB10_PARITY_PHASE0_RESULTS.md`, and the active `docs/superpowers/plans/`
> Phase 6/Phase 7 files for the current measured state and follow-up scope.
> **Status: CLOSED.** This is the standing record of the exhaustive GB10 (DGX
> Spark, sm_121) parity investigation for `llama-cpp-localai-paged` against vLLM
> on the Qwen3.6 hybrid gated-DeltaNet NVFP4 models. It exists so the

View File

@@ -1,6 +1,6 @@
# Phase 6: Serving nsys Gap Classifier
**Status:** In progress.
**Status:** Completed. Phase 6 kept no source changes.
**Scope:** Measurement-first. Do not edit llama.cpp source in this phase unless
the serving profiles identify a small, bit-exact, fork-first patch candidate.
@@ -22,6 +22,10 @@ measured evidence.
- Patch promotion threshold: no semantic gate regression, no generated patch
hand-editing, and at least one measured serving bucket improvement that explains
a material share of the vLLM gap.
- Inference-safety rule: a candidate that changes CUDA routing, sampler inputs,
graph construction, or MoE kernels is not kept unless the md5 gates are rerun
from the clean candidate binary and still match the canonical values above.
Performance-only evidence is insufficient.
## Checklist
@@ -84,7 +88,7 @@ measured evidence.
## Current Decision
W4A16 prefill is no longer the highest-leverage path. The accepted Phase 1-4
W4A16 prefill was not the highest-leverage path for Phase 6. The accepted Phase 1-4
changes improved forced W4A16 from roughly `1314/1339` to `1466/1495` S_PP, but
default FP4-MMQ remains around `2303/2423`. The next evidence gate is serving
nsys, because the committed lever map says the residual gap is in real
@@ -203,3 +207,15 @@ Shape: `n=128`, `ptok=128`, `gen=64`.
Result: rejected as an env-only lever. Existing grouped-MMQ tile knobs do not
materially close the serving gap, so a selector-only source patch is not
justified.
## Completion
Phase 6 completed as a classifier, not as a source patch phase:
- Accepted source patches before Phase 6 remained intact through fork head
`d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`.
- The sampler short-circuit candidate passed inference gates but failed the
serving performance gate, so it was reverted and not mirrored.
- GDN and grouped-MMQ env grids did not clear the material-improvement threshold.
- No LocalAI patch was generated for Phase 6. The next phase must start from a
clean fork and keep the same md5/op gates before any source candidate is kept.

View File

@@ -0,0 +1,159 @@
# Phase 7: Serving Source Candidate Scope
**Status:** Scoped. Code implementation not started.
**Goal:** Select one maintainable source candidate for the remaining GB10 MoE
serving gap, then implement only if it can be gated for inference correctness and
measured against a bucket that Phase 6 proved relevant.
## Entry State
- llama.cpp fork: `/home/mudler/_git/llama.cpp`
- Required branch: `localai-paged`
- Required clean head: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
- LocalAI patch mirror count before Phase 7: `41`, through patch `0050`
- DGX mirror used by Phase 6: `/home/mudler/llama-phase6-source`
## Required Safety Gates
- Before DGX work:
- `docker ps -q | wc -l` must be `0`.
- `nvidia-smi --query-compute-apps=pid --format=csv,noheader` must be empty.
- `~/gpu_bench_lock/owner` must be absent or start with `FREE`.
- No `local-ai-worker` container may be running.
- Before keeping any source patch:
- MoE greedy md5 must be `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense greedy md5 must be `5951a5b4d624ce891e22ab5fca9bc439`.
- If W4A16 is touched, forced `bm32` and `base` md5 must both be
`07db32c2bcb78d17a43ed18bc22705cd`.
- If `MUL_MAT_ID` routing or CUDA MoE kernels are touched, run
`test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`.
- Patch handling:
- Source changes are fork-first in `/home/mudler/_git/llama.cpp`.
- Keep each patch incremental and additive, with helper functions preferred
over invasive rewrites.
- Regenerate LocalAI patches with `git format-patch`; do not hand-edit
generated patch files.
## Candidate Tracks
### Track A: Structural MoE Decode Kernel
Phase 6 evidence: grouped NVFP4 `mul_mat_q` accounts for roughly 30% of llama.cpp
GPU kernel time under serving, while vLLM's Marlin-MoE bucket is materially
smaller in the same workload class.
The candidate must identify a bounded change in the current `MUL_MAT_ID` or
grouped-MMQ path that reduces actual serving bucket time. Selector-only tile
retuning is rejected unless new evidence differs from the Phase 6 MMQ grid.
Selected first candidate:
- Add a batched CUDA path that fuses MoE SWIGLU with the NVFP4 activation
quantization feeding the **down** `MUL_MAT_ID`.
- Current graph shape:
`ffn_moe_gate_up` `MUL_MAT_ID` -> gate/up views -> `ggml_swiglu_split` ->
`ffn_moe_down` `MUL_MAT_ID`.
- Target: remove or reduce the separate f32 SWIGLU intermediate write/read and
`quantize_mmq_nvfp4` pass for the down projection while preserving the existing
grouped-MMQ kernel and accumulation order.
- Keep scope to CUDA, Blackwell native FP4, `GGML_TYPE_NVFP4`, merged gate/up
MoE, down projection only, no bias/clamp/OAI/GEGLU.
Important finding:
- Existing CUDA `MUL_MAT_ID + GLU` fusion is vector-only. The fusion predicates
reject `MUL_MAT_ID` when `dst->ne[2] != 1`, so it does not cover the Phase 6
multi-token serving shape.
- Existing `MUL_MAT_ID_FUSION` tests cover add/mul after `MUL_MAT_ID`, not the
gate_up/SWIGLU/down chain. Do not treat them as sufficient for this candidate.
Initial files to inspect:
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu`
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu`
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu`
- vLLM Marlin-MoE implementation files in the local vLLM checkout/package.
### Track B: Serving Input And Sampler Synchronization
Phase 6 evidence: `cudaStreamSynchronize` dominates CUDA API time, and many
syncs follow small `cudaMemcpyAsync` calls. The greedy sampler short-circuit
passed correctness gates but did not improve serving, so this track needs a
workload where sampler/input upload cost is proven relevant before patching.
Initial files to inspect:
- `/home/mudler/_git/llama.cpp/src/llama-sampling.cpp`
- `/home/mudler/_git/llama.cpp/src/llama-context.cpp`
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-backend.cpp`
- CUDA backend tensor-set paths under `/home/mudler/_git/llama.cpp/ggml/src/`.
Selected secondary candidate:
- Cache backend logit-bias tensor uploads in
`/home/mudler/_git/llama.cpp/src/llama-sampler.cpp`
`llama_sampler_logit_bias_backend_set_input()`.
- Today the sampler rebuilds and uploads `logit_bias` and `logit_idxs` every
decode step. Those uploads hit the CUDA tensor-set path with immediate
`cudaStreamSynchronize`.
- This is narrow and maintainable, but it is not the default greedy parity
lever. Only promote it if a non-greedy backend-sampling workload with non-empty
`logit_bias` proves the sync bucket is material.
Required workload:
- Include a non-greedy serving shape if the patch targets sampler randomness or
probability upload behavior.
- Preserve the canonical greedy md5 gates even if the optimization targets
non-greedy serving.
## Decision Gate
Only one track may enter implementation at a time. Promote a candidate from scope
to implementation when all are true:
- It has an exact file/function target.
- It is additive enough to minimize upstream conflicts.
- It has a direct measurement bucket from Phase 6 or a fresh bounded profile.
- It has a clear rollback path.
- It passes the md5/op gates before any performance result is accepted.
## Checklist
- [x] Close remaining Phase 6 explorer agents or capture their final findings.
- [x] Reconfirm DGX idle state before any new benchmark.
- Docker containers: `0`.
- `local-ai-worker`: `0`.
- Compute PIDs: `0`.
- Lock: `FREE released-by-codex-phase6-mmq-grid 1782860601`.
- [x] Pick Track A or Track B from concrete code evidence.
- Primary: Track A, batched MoE SWIGLU -> NVFP4 down-input quantization.
- Secondary: Track B, backend logit-bias upload cache for non-greedy workloads.
- [ ] Run baseline gates from the clean candidate build.
- [ ] Implement one fork-first incremental patch.
- [ ] Run md5/op gates before serving A/B.
- [ ] Keep only if the serving bucket and h2h result improve materially.
- [ ] Regenerate LocalAI patch stack and update docs if kept.
## Required Tests Before Track A Source Patch
- Add or extend a whole-graph op test for the batched MoE gate_up/SWIGLU/down
chain. Shapes must include `type_a=NVFP4`, `n_mats=128`, `n_used=8`,
`m=768`, `k=2048`, and `n in {16, 33, 64, 128, 130, 200}`.
- Run `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`
until a more specific op name is available.
- Run canonical MoE and dense greedy md5 gates before serving A/B:
- MoE `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense `5951a5b4d624ce891e22ab5fca9bc439`.
- Run a mixed prompt/decode md5 gate (`ptok=512`, `gen=32`) because graph reuse
can hide bugs that a decode-only gate misses.
## Required Tests Before Track B Source Patch
- Establish fixed-seed baseline output md5 and token-id parity for a
backend-sampling request with non-empty `logit_bias`.
- Include the canonical greedy MoE and dense md5 gates even though the workload
target is non-greedy.
- Run existing server completion tests covering backend sampling probabilities
and logit-bias behavior.