mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-26 01:16:58 -04:00
Fresh nsys --cuda-graph-trace=node capture of one steady decode step on q36-27b-nvfp4 dense at npl128 (clean Lever-1 build-cuda-base). The decode step is a single CUDA graph; node-level expansion shows it is 99.94% GPU-busy on a single stream with 0.225 ms/step inter-kernel idle (0.06%, zero gaps >5us). This refutes the "~60% idle bubbles / 57 ms = 100% bubble" hypothesis and confirms the cudagraph-coverage source verdict. Real decode mix: gated_delta_net 196 ms = 51.6% of the step (4.08 ms/call x48; the prior 1.47 ms/call "near-vLLM" was a prefill-contaminated eager average), FP4 GEMM+quantize 29%, gating glue (Lever 3 target) only 3.35%, gdn_gather 0.06 ms. By roofline-decode's own sizing test (idle < 57 ms => gap is elsewhere) the 14% gap to vLLM lives in kernel GPU-time, dominated by the bandwidth-bound GDN recurrence, not in bubbles; Lever 3 fusion is resized to ~3% and reframed as byte-reduction, not bubble removal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>