Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.
Assisted-by: Codex:gpt-5
The decode-serving section characterized the high-N gap as "BW-floored, vLLM
pays equally / 56-68%". A clean uncontended graph-node-traced profile
(dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling
artifact: decode runs as a replayed CUDA graph, and nsys without
--cuda-graph-trace=node collapses each replay into one opaque launch, so every
prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient")
was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16
difference method.
Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN
recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM
254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled:
vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates
its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718
(61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true
GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged
leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch).
The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton
elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4%
(ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling),
norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers:
Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from
hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block
GDN occupancy to -1.04% (wave-hidden).
Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real);
DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a
measurement/operating-point artifact). GB10-vs-datacenter framing kept.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the
exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid
models. Captures the definitive same-session both-engine benchmark (prefill
S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for
both the MoE 35B-A3B and dense 27B models), the complete lever map (every
prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict
and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality,
GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the
shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on
GB10, not missing optimizations; the path to parity is datacenter Blackwell.
Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the
marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP,
DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are
marked estimated. Add a section-9 summary + link in the backend README.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>