mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): scope phase7 serving candidates
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -560,3 +560,5 @@ Result:
|
||||
- No current env-only lever clears the serving performance gate. Scope the next
|
||||
source candidate against either structural MoE decode fusion or async serving
|
||||
input/sampler uploads, with a workload that proves the target bucket matters.
|
||||
- Phase 7 must keep the canonical MoE and dense md5 gates as the first
|
||||
inference-safety check before any performance result is accepted.
|
||||
|
||||
@@ -1,5 +1,11 @@
|
||||
# PARITY_HANDOFF: how to pick up the GB10 vLLM-parity work
|
||||
|
||||
> 2026-06-30 update: this handoff is now historical procedure, not the active
|
||||
> verdict. The GB10 investigation was reopened in `GB10_PARITY_REOPEN_SPEC.md`
|
||||
> and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and
|
||||
> the active follow-up plans under `docs/superpowers/plans/`. Use those files for
|
||||
> the current state before relying on the older "closed" conclusion below.
|
||||
|
||||
Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend.
|
||||
|
||||
This file is the **operational how-to**. It is the companion to `VLLM_PARITY_FINAL.md`, which is the **why / authoritative record** ("never re-litigate"). If the two ever disagree on a *fact*, `VLLM_PARITY_FINAL.md` and the bench artifacts it cites win; this file wins on *procedure* (how to ssh, lock, build, bench, profile).
|
||||
|
||||
@@ -1,5 +1,10 @@
|
||||
# vLLM Parity - Final State (Qwen3.6 NVFP4 on GB10)
|
||||
|
||||
> 2026-06-30 update: this document records the earlier final-state verdict. The
|
||||
> investigation has since been reopened; see `GB10_PARITY_REOPEN_SPEC.md`,
|
||||
> `GB10_PARITY_PHASE0_RESULTS.md`, and the active `docs/superpowers/plans/`
|
||||
> Phase 6/Phase 7 files for the current measured state and follow-up scope.
|
||||
|
||||
> **Status: CLOSED.** This is the standing record of the exhaustive GB10 (DGX
|
||||
> Spark, sm_121) parity investigation for `llama-cpp-localai-paged` against vLLM
|
||||
> on the Qwen3.6 hybrid gated-DeltaNet NVFP4 models. It exists so the
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Phase 6: Serving nsys Gap Classifier
|
||||
|
||||
**Status:** In progress.
|
||||
**Status:** Completed. Phase 6 kept no source changes.
|
||||
|
||||
**Scope:** Measurement-first. Do not edit llama.cpp source in this phase unless
|
||||
the serving profiles identify a small, bit-exact, fork-first patch candidate.
|
||||
@@ -22,6 +22,10 @@ measured evidence.
|
||||
- Patch promotion threshold: no semantic gate regression, no generated patch
|
||||
hand-editing, and at least one measured serving bucket improvement that explains
|
||||
a material share of the vLLM gap.
|
||||
- Inference-safety rule: a candidate that changes CUDA routing, sampler inputs,
|
||||
graph construction, or MoE kernels is not kept unless the md5 gates are rerun
|
||||
from the clean candidate binary and still match the canonical values above.
|
||||
Performance-only evidence is insufficient.
|
||||
|
||||
## Checklist
|
||||
|
||||
@@ -84,7 +88,7 @@ measured evidence.
|
||||
|
||||
## Current Decision
|
||||
|
||||
W4A16 prefill is no longer the highest-leverage path. The accepted Phase 1-4
|
||||
W4A16 prefill was not the highest-leverage path for Phase 6. The accepted Phase 1-4
|
||||
changes improved forced W4A16 from roughly `1314/1339` to `1466/1495` S_PP, but
|
||||
default FP4-MMQ remains around `2303/2423`. The next evidence gate is serving
|
||||
nsys, because the committed lever map says the residual gap is in real
|
||||
@@ -203,3 +207,15 @@ Shape: `n=128`, `ptok=128`, `gen=64`.
|
||||
Result: rejected as an env-only lever. Existing grouped-MMQ tile knobs do not
|
||||
materially close the serving gap, so a selector-only source patch is not
|
||||
justified.
|
||||
|
||||
## Completion
|
||||
|
||||
Phase 6 completed as a classifier, not as a source patch phase:
|
||||
|
||||
- Accepted source patches before Phase 6 remained intact through fork head
|
||||
`d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`.
|
||||
- The sampler short-circuit candidate passed inference gates but failed the
|
||||
serving performance gate, so it was reverted and not mirrored.
|
||||
- GDN and grouped-MMQ env grids did not clear the material-improvement threshold.
|
||||
- No LocalAI patch was generated for Phase 6. The next phase must start from a
|
||||
clean fork and keep the same md5/op gates before any source candidate is kept.
|
||||
|
||||
159
docs/superpowers/plans/2026-06-30-serving-source-phase7.md
Normal file
159
docs/superpowers/plans/2026-06-30-serving-source-phase7.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# Phase 7: Serving Source Candidate Scope
|
||||
|
||||
**Status:** Scoped. Code implementation not started.
|
||||
|
||||
**Goal:** Select one maintainable source candidate for the remaining GB10 MoE
|
||||
serving gap, then implement only if it can be gated for inference correctness and
|
||||
measured against a bucket that Phase 6 proved relevant.
|
||||
|
||||
## Entry State
|
||||
|
||||
- llama.cpp fork: `/home/mudler/_git/llama.cpp`
|
||||
- Required branch: `localai-paged`
|
||||
- Required clean head: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`
|
||||
- LocalAI patch mirror count before Phase 7: `41`, through patch `0050`
|
||||
- DGX mirror used by Phase 6: `/home/mudler/llama-phase6-source`
|
||||
|
||||
## Required Safety Gates
|
||||
|
||||
- Before DGX work:
|
||||
- `docker ps -q | wc -l` must be `0`.
|
||||
- `nvidia-smi --query-compute-apps=pid --format=csv,noheader` must be empty.
|
||||
- `~/gpu_bench_lock/owner` must be absent or start with `FREE`.
|
||||
- No `local-ai-worker` container may be running.
|
||||
- Before keeping any source patch:
|
||||
- MoE greedy md5 must be `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense greedy md5 must be `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- If W4A16 is touched, forced `bm32` and `base` md5 must both be
|
||||
`07db32c2bcb78d17a43ed18bc22705cd`.
|
||||
- If `MUL_MAT_ID` routing or CUDA MoE kernels are touched, run
|
||||
`test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`.
|
||||
- Patch handling:
|
||||
- Source changes are fork-first in `/home/mudler/_git/llama.cpp`.
|
||||
- Keep each patch incremental and additive, with helper functions preferred
|
||||
over invasive rewrites.
|
||||
- Regenerate LocalAI patches with `git format-patch`; do not hand-edit
|
||||
generated patch files.
|
||||
|
||||
## Candidate Tracks
|
||||
|
||||
### Track A: Structural MoE Decode Kernel
|
||||
|
||||
Phase 6 evidence: grouped NVFP4 `mul_mat_q` accounts for roughly 30% of llama.cpp
|
||||
GPU kernel time under serving, while vLLM's Marlin-MoE bucket is materially
|
||||
smaller in the same workload class.
|
||||
|
||||
The candidate must identify a bounded change in the current `MUL_MAT_ID` or
|
||||
grouped-MMQ path that reduces actual serving bucket time. Selector-only tile
|
||||
retuning is rejected unless new evidence differs from the Phase 6 MMQ grid.
|
||||
|
||||
Selected first candidate:
|
||||
|
||||
- Add a batched CUDA path that fuses MoE SWIGLU with the NVFP4 activation
|
||||
quantization feeding the **down** `MUL_MAT_ID`.
|
||||
- Current graph shape:
|
||||
`ffn_moe_gate_up` `MUL_MAT_ID` -> gate/up views -> `ggml_swiglu_split` ->
|
||||
`ffn_moe_down` `MUL_MAT_ID`.
|
||||
- Target: remove or reduce the separate f32 SWIGLU intermediate write/read and
|
||||
`quantize_mmq_nvfp4` pass for the down projection while preserving the existing
|
||||
grouped-MMQ kernel and accumulation order.
|
||||
- Keep scope to CUDA, Blackwell native FP4, `GGML_TYPE_NVFP4`, merged gate/up
|
||||
MoE, down projection only, no bias/clamp/OAI/GEGLU.
|
||||
|
||||
Important finding:
|
||||
|
||||
- Existing CUDA `MUL_MAT_ID + GLU` fusion is vector-only. The fusion predicates
|
||||
reject `MUL_MAT_ID` when `dst->ne[2] != 1`, so it does not cover the Phase 6
|
||||
multi-token serving shape.
|
||||
- Existing `MUL_MAT_ID_FUSION` tests cover add/mul after `MUL_MAT_ID`, not the
|
||||
gate_up/SWIGLU/down chain. Do not treat them as sufficient for this candidate.
|
||||
|
||||
Initial files to inspect:
|
||||
|
||||
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu`
|
||||
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu`
|
||||
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu`
|
||||
- vLLM Marlin-MoE implementation files in the local vLLM checkout/package.
|
||||
|
||||
### Track B: Serving Input And Sampler Synchronization
|
||||
|
||||
Phase 6 evidence: `cudaStreamSynchronize` dominates CUDA API time, and many
|
||||
syncs follow small `cudaMemcpyAsync` calls. The greedy sampler short-circuit
|
||||
passed correctness gates but did not improve serving, so this track needs a
|
||||
workload where sampler/input upload cost is proven relevant before patching.
|
||||
|
||||
Initial files to inspect:
|
||||
|
||||
- `/home/mudler/_git/llama.cpp/src/llama-sampling.cpp`
|
||||
- `/home/mudler/_git/llama.cpp/src/llama-context.cpp`
|
||||
- `/home/mudler/_git/llama.cpp/ggml/src/ggml-backend.cpp`
|
||||
- CUDA backend tensor-set paths under `/home/mudler/_git/llama.cpp/ggml/src/`.
|
||||
|
||||
Selected secondary candidate:
|
||||
|
||||
- Cache backend logit-bias tensor uploads in
|
||||
`/home/mudler/_git/llama.cpp/src/llama-sampler.cpp`
|
||||
`llama_sampler_logit_bias_backend_set_input()`.
|
||||
- Today the sampler rebuilds and uploads `logit_bias` and `logit_idxs` every
|
||||
decode step. Those uploads hit the CUDA tensor-set path with immediate
|
||||
`cudaStreamSynchronize`.
|
||||
- This is narrow and maintainable, but it is not the default greedy parity
|
||||
lever. Only promote it if a non-greedy backend-sampling workload with non-empty
|
||||
`logit_bias` proves the sync bucket is material.
|
||||
|
||||
Required workload:
|
||||
|
||||
- Include a non-greedy serving shape if the patch targets sampler randomness or
|
||||
probability upload behavior.
|
||||
- Preserve the canonical greedy md5 gates even if the optimization targets
|
||||
non-greedy serving.
|
||||
|
||||
## Decision Gate
|
||||
|
||||
Only one track may enter implementation at a time. Promote a candidate from scope
|
||||
to implementation when all are true:
|
||||
|
||||
- It has an exact file/function target.
|
||||
- It is additive enough to minimize upstream conflicts.
|
||||
- It has a direct measurement bucket from Phase 6 or a fresh bounded profile.
|
||||
- It has a clear rollback path.
|
||||
- It passes the md5/op gates before any performance result is accepted.
|
||||
|
||||
## Checklist
|
||||
|
||||
- [x] Close remaining Phase 6 explorer agents or capture their final findings.
|
||||
- [x] Reconfirm DGX idle state before any new benchmark.
|
||||
- Docker containers: `0`.
|
||||
- `local-ai-worker`: `0`.
|
||||
- Compute PIDs: `0`.
|
||||
- Lock: `FREE released-by-codex-phase6-mmq-grid 1782860601`.
|
||||
- [x] Pick Track A or Track B from concrete code evidence.
|
||||
- Primary: Track A, batched MoE SWIGLU -> NVFP4 down-input quantization.
|
||||
- Secondary: Track B, backend logit-bias upload cache for non-greedy workloads.
|
||||
- [ ] Run baseline gates from the clean candidate build.
|
||||
- [ ] Implement one fork-first incremental patch.
|
||||
- [ ] Run md5/op gates before serving A/B.
|
||||
- [ ] Keep only if the serving bucket and h2h result improve materially.
|
||||
- [ ] Regenerate LocalAI patch stack and update docs if kept.
|
||||
|
||||
## Required Tests Before Track A Source Patch
|
||||
|
||||
- Add or extend a whole-graph op test for the batched MoE gate_up/SWIGLU/down
|
||||
chain. Shapes must include `type_a=NVFP4`, `n_mats=128`, `n_used=8`,
|
||||
`m=768`, `k=2048`, and `n in {16, 33, 64, 128, 130, 200}`.
|
||||
- Run `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`
|
||||
until a more specific op name is available.
|
||||
- Run canonical MoE and dense greedy md5 gates before serving A/B:
|
||||
- MoE `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Run a mixed prompt/decode md5 gate (`ptok=512`, `gen=32`) because graph reuse
|
||||
can hide bugs that a decode-only gate misses.
|
||||
|
||||
## Required Tests Before Track B Source Patch
|
||||
|
||||
- Establish fixed-seed baseline output md5 and token-id parity for a
|
||||
backend-sampling request with non-empty `logit_bias`.
|
||||
- Include the canonical greedy MoE and dense md5 gates even though the workload
|
||||
target is non-greedy.
|
||||
- Run existing server completion tests covering backend sampling probabilities
|
||||
and logit-bias behavior.
|
||||
Reference in New Issue
Block a user