mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-30 19:37:00 -04:00
The decode-serving section characterized the high-N gap as "BW-floored, vLLM pays equally / 56-68%". A clean uncontended graph-node-traced profile (dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling artifact: decode runs as a replayed CUDA graph, and nsys without --cuda-graph-trace=node collapses each replay into one opaque launch, so every prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient") was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16 difference method. Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM 254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled: vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718 (61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch). The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4% (ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling), norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers: Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block GDN occupancy to -1.04% (wave-hidden). Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real); DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a measurement/operating-point artifact). GB10-vs-datacenter framing kept. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>